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Densities for u and x (under the assumption of Gaussian marginals) and 500 samples. 
Simulated returns and stock prices for 100 (left) and 10,000 samples (right). White circles are 
theoretical quantiles and black ones are their empirical counterparts. Theoretical and empirical 
means are white and black stars, respectively. 

Returns for portfolios consisting of N equally weighted assets (“1-over-N”). 

Histograms for the value of a stock (top panel) and a portfolio consisting of the stock plus one 
European put (bottom panel) at times T. 

Left panel: Value-at-Risk for stock (dotted lines) and portfolio (solid line) for 1%, 5%, and 10% 
quantiles (light to dark). Right panel: 5%, 25%, 50%, 75%, and 95% percentiles of log returns for 
stock (dotted) and portfolio (solid line). 

Returns and prices for different AR, MA, and ARMA models. 

Scaled white noise; dark lines indicate to;. 


Daily returns (gray lines) and u + 20; confidence bands (dark lines) (left panel) for different 
combinations of a1 + 81 = 0.95 and prices (right panel) for GARCH(1, 1) processes. 

Actual daily log returns for the FTSE and S&P 500 for the period Jan 2000 to Dec 2009 and +2./h; 
(top panel), and innovations standardized using time-varying volatility, /h;, as fitted with a 
GARCH(1, 1) (bottom panel). 

Actual FTSE daily log returns for July 2004 to June 2010 (left) and two simulations based on a 
GARCH model fitted on actual data (center and right; 2525 observations each; seeds fixed with 
randn(’seed’,10) and randn(‘seed’ , 50), respectively). 


Kernel densities for cumulated returns (250 days; left panel) and prices (right panel) where daily 
returns follow a GARCH(1, 1) process with a; € {0.05, 0.20.350.95} and £1 = 0.95 — a; (darker 
lines indicate lower 1). 

Stock prices under rational expectations from the Timmermann (1993) model (n = 25, u = 0.01, 

o =0.05). Rational expectation prices (thin gray line) and fundamental prices with known 
parameters (thick gray line). 

Bootstrap samples for FTSE data. 

Moments of a buy-and-hold portfolio (thick gray lines) and its constituents (thin black lines; FTSE, 
DAX, and EuroStoxx) over different lengths of investment horizons (x-axis: T = 1,5, 21, 62 days; 
corresponding to | day, 1 week, 1 month, 1 quarter); based on original data August 2005 to July 
2010 and 1,000,000 bootstraps. Top panel, block length b = 1; center panel, block length b = 5; and 
bottom panel, block length b = T. 

Price changes (top panel) and fraction of fundamentalists in the market (bottom panel) from an 
agent-based simulation (Algorithm 27). 

CPPI composition for two sample price processes and different multipliers time to maturity of 2 
years and daily readjustment (“no gap risk”) and quarterly readjustment (rightmost; “with gap risk”). 
Light gray: safe asset, Bz; dark gray: exposure, E+; white line: floor, F;. 

FTSE time series, 10-block bootstraps for risky asset and trajectories for CPPI. 

Terminal value of CPPI with FTSE as T = 1 year, multiplier m = 4, and different gap lengths, 
simulated with block bootstrap (right). 
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Median of ratios VaR!!! / VaRthe° for different underlying distributions (panels) and different 
fractions of data used, m/N (lines). 

20 paths of geometric Brownian motion. 

Five paths of a geometric Brownian bridge, starting at 100 and ending at 103. 

Speed of convergence for Monte Carlo pricing. Sọ = 50, X = 50, r = 0.05, ./v = 0.30, and t = 1. 
The true Black-Scholes price is 7.12. 

Forward difference for Greeks: Boxplots for Delta estimates with M = 1 and N = 100,000 for 
different values of h (y-axis). Parameters are § = 100, X = 100, t = 1, r = 0.03, q = 0, and o = 0.2. 
The solid horizontal line is the true BS Delta of 0.60; the dotted lines give Delta +0.05. 

Forward difference for Greeks: Boxplots for Delta estimates with M = 1 and N = 100,000 for 
different values of h (y-axis), but using common random variates. Parameters are S = 100, X = 100, 
t = 1, r = 0.03, q =0, and o = 0.2. The solid horizontal line is the true BS Delta of 0.60; the dotted 
lines give Delta +0.05. 

Top left: scatter of 500 against 500 points generated with MATLAB’s rand. Top right: scatter of 500 
against 500 generated with Halton sequences (bases 2 and 7). Bottom left: scatter of 500 against 500 
generated with Halton sequences (bases 23 and 29). Bottom right: scatter of 500 against 500 
generated with Halton sequences (bases 59 and 89). 

Convergence of price with quasi-MC (function cal1BSMQMC). The light gray lines show the 
convergence with standard MC approach. 

Three paths generated with bases 3, 13, and 31. 

Objective function for Value-at-Risk. 


LMS objective function. 

Simulated objective function for Kirman’s model for two parameters. 

Left panel: Heston model objective function. Right panel: Nelson—Siegel-Svensson model objective 
function. 

Synoptic view of methods presented in the chapter. 

Left panel: iteration function g4. Right panel: iteration function g2. 

Left panel: iteration function g3. Right panel: iteration function g4. 

Left panel: iteration function satisfies g'(x) < — 1. Right panel: iteration function satisfies 

—1 < g'(x) <0. 

Shape of function px (left panel) and shape of function px(s) ¢(s) (right panel). 

Behavior of the Newton method for different starting values. Upper left: x0 = 2.750 and 

x80! = 2.4712. Upper right: x0 = 0.805 and x8°! = 2.4712. Lower left: x0 = 0.863 and 

xS0l = 1.4512. Lower right: x0 = 1.915 and algorithm diverges. 

Minimization of f (x1, x2) = exp (0.1 (x2 — x2 + 0.05 (1 — x1)?) with the steepest descent 
method. Right panel: minimization of æ for the first step (a* = 5.87). 

Minimization of f (x1, x2) = exp ( 0.1 (x2 — xy + 0.05 (1 — xı 7) with Newton’s method. Contour 
plots of the local model for the first three steps. 

Nonconvex local model for starting point (left panel) and convex local model but divergence of the 
step (right panel). 

Comparison of the first seven Newton iterations (circles) with the first five BFGS iterations (triangles). 
Evolution of starting simplex in the Nelder—Mead algorithm. 

Detailed starting simplex and two reflections. 

Left panel: Plot of observations, model for xı = 2.5, x7 = 2.5 and residuals. Right panel: plot of sum 
of squared residuals g(x). 

Starting point (bullet), one step (circle) and contour plots of local models for Newton (upper right), 
Gauss—Newton (lower left) and Levenberg—Marquardt (lower right). 

Solutions for varying values of parameter c of the systems of two nonlinear equations. 

Steps of the fixed point algorithm for the starting solution yO=[1 57. 

Oscillatory behavior of the fixed point algorithm for the starting solution yO=[2 07. 

Steps of the Newton algorithm for starting solution yO=[1 5]. 

Steps of the Broyden algorithm for starting point yO = [5/2 1] and identity matrix for BO), 

Steps of the Broyden algorithm for starting point yO = [5/2 1] and Jacobian matrix evaluated at the 
starting point for BO, 

Objective function minimizing ||F (y)||2 for the solution of the system of nonlinear equations defined 
in Example 11.4. 

Synoptic view of solution methods for systems of linear and nonlinear equations. 

Scheme for possible hybridizations. 

Shekel function. For better visibility — f has been plotted. 

Shekel function. Empirical distribution of solutions. 

Shekel function. Left panel: Threshold sequence. Right panel: Objective function values over the 6 
rounds of the 5th restart. 
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Subset sum problem. Left panel: Threshold sequence. Right panel: Objective function values over 
the 6 rounds of the Sth restart. 


Subset sum problem: Evolution of the fittest solution in restart 3. The algorithm stops at generation 6. 


Differential Evolution optimization. Left panel: Empirical distribution of solutions. Right panel: Best 
value over generations of restart 3. 

Particle Swarm optimization. Left panel: Empirical distribution of solutions. Right panel: Best value 
over generations of restart 1. 

Code DCex.m. Theoretical speedup and efficiency as a function of processors. Dots represent 
computed values. 

The upper panel shows the left tail of the distribution of objective function values for a sample of 
random solutions. We sample uniformly, and thus the distribution function resembles a straight line. 
The lower panel shows the distribution for the best 100 of the one million random solutions. 
Distributions of objective function values of best 100 random solutions and 100 greedy solutions. 
Best possible solution would be zero. 

Objective function values of an unguided (random) walk through the search space. 

Distributions of changes in objective function values when 1 (dark gray), 5 (gray), and 10 (light gray) 
elements are changed. 

No structure in the objective function. In search spaces such as the ones pictured a Local Search will 
fail, because the objective function has no local structure, i.e. it provides no guidance. Being close to 
a good solution cannot be exploited by the algorithm. 

Objective function value for two runs of Local Search over 50,000 iterations (log scale). 
Distributions of objective function values of best 100 random solutions, 100 greedy solutions, 100 
solutions obtained by Local Search and 100 solutions obtained by Threshold Accepting. Best 
possible solution would be zero. 

True frontier and estimated and realized frontiers (no short sales, na is 25, ng is 100). 

Portfolio return distributions: the influence of rg. 

In-sample versus out-of-sample differences. Left: in-sample difference of a solution pair (QP minus 
TA) against the associated out-of-sample difference. Right: same, but only for portfolios for which 
the in-sample difference is less than one basis point. 

In-sample versus out-of-sample difference depending on the number of iterations. 

Squared returns (left panel) and ratio of conditional moments (right panel) for different portfolio 
weights. 

Distributions of objective function values for random portfolios (light gray) and solutions of TAopt 
and LSopt (darker gray). There is little difference between Local Search and Threshold Accepting. 
Distributions of objective function values with 500, 1000 and 2000 iterations. The lighter-gray lines 
belong to TAopt, the darker-gray ones to LSopt. There is virtually no difference between the lines 
for Local Search and Threshold Accepting with 2000 iterations; both methods converge quickly on 
the same portfolio. 

Distribution of end-of-horizon wealth in a VaR setting. 

Daily returns of the S&P 500, including October 1987. 

S&P 500 without 10 best and 10 worst days. 

A random series created with function randomPriceSeries. 

Distribution of difference in final profits MA-crossover strategy vs buy-and-hold. A positive number 
means that MA-crossover outperformed the underlying asset. The distribution is symmetric about 
zero, which indicates that there is no systematic advantage to the crossover strategy. 

Final profits of MA-crossover strategy vs result of buy-and-hold of the same underlying series. 
Equity curves of optimized backtests. 

Distribution of in-sample excess profits for optimized MA crossover strategy. A positive number 
means that MA crossover outperformed the underlying asset. 

Observational Equivalence: EURO STOXX Total Market and EURO STOXX Banks. 

Mechanics of a walk-forward. 

Survivorship bias when using current components of an index 

S&P stock market index since January 1871 

CAPE ratio 

Results of avoiding high valuation. The S&P index is shown in black; the strategy in gray. 

Results of avoiding high valuation compared with S&P 500. 

Performance of strategy for different values of q. The S&P index is shown in black; the strategy 
variations in gray. 

Performance of industry portfolios as provided by Kenneth French’s data library, January 1990 to 
May 2018. Time-series are computed from daily returns. 

Fanplot of the industry time-series shown in Fig. 15.16. 

Correlations of monthly returns. 
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Benchmarks: Performance of market portfolio and of equally weighted portfolio. The gray shades 
are the same as in Fig. 15.17 and indicate the range of performance of the different industries. 
Benchmarks: Outperformance of market versus equally weighted portfolio. A rising line indicates 
the outperformance of the market portfolio, and vice versa. 

Performance of momentum strategy. The vertical axis uses a log scale. 

Performance of momentum strategy. The vertical axis uses a log scale. The gray line shows the 
absolute performance of the market. The black line shows the relative performance of momentum 
compared with the market, i.e. a rising line indicates an outperformance of momentum. 

Results from sensitivity check: Fanplots of paths with frequent rebalancing (the upper band) and less 
frequent rebalancing (the lower band). 

Results from sensitivity check: The ratio of the median paths of the fanplots in Fig. 15.23. The 
steadily rising line indicates that frequent rebalancing outperforms less frequent rebalancing. 
Results from sensitivity checks: densities of annualized returns of the fanplots in Fig. 15.23. The 
distribution to the right belongs to the paths with more frequent rebalancing. 

Performance of the long-only minimum-variance portfolio (gray) vs market (black). On Nov 19, 
1999, both series have a value of 100 (the MV time-series starts later because of its burn-in of 

10 years). 

Performance of 100 walk-forwards with random rebalancing periods. The portfolios are computed 
with QP. All randomness in the results comes through differences in the setup; there is no numeric 
randomness. 

Correlation of daily returns when portfolios are rebalanced at random timestamps. 

Positions in a single industry across four backtest variations. The backtests differ only in their 
rebalancing schedules; hence the positions are very similar. 

Performance of 100 walk-forwards with fixed rebalancing periods. The portfolios are computed with 
Local Search. All variations in the results stem from the randomness inherent in Local Search 
(though it is barely visible); there are no other elements of chance in the model. Compare with 

Fig. 15.27. 

Interpolation and approximation. The panels show a set of market rates. Left: we interpolate the 
points. Right: we approximate the points. 

The matrix C for the cash flows of the 44 bonds in bundData, each square represents a nonzero 
entry. Each row gives the cash flows of one bond, each column is associated with one payment date. 
Level. The left panel shows y(t) = 81 = 3. The right panel shows the corresponding yield curve, in 
this case also y(t) = 1 = 3. The influence of £4 is constant for all t. 


Short-end shift. The left panel shows y(t) = B2 [ oe — ao] for By = —2. The right panel shows 


1—exp(—1/A1) 

Tk i ] for 
1 = 3, Bo = —2. The short end is shifted down by 2%, but then the curve grows back to the long-run 
level of 3%. 
Hump. The left panel shows 63 [See 


yield curve resulting from all three components. In all panels, A, is 2. 


the yield curve resulting from the effects of 6; and £2, that is, y(t) = By + Bo [ 


— exp(—t/21)| for 63 = 6. The right panel shows the 


Nelson—Siegel—Svensson: The three panels show the correlation between the second and the third, 
the second and the fourth, and the third and the fourth regressors in Eqs. (16.10) for different 
A-values (the x- and y-axes show A; and Az between 0 and 25). 

Distributions of estimated parameters. The true parameters are 4, —2, and 2. 

Negative interest rates with Nelson—Siegel model despite parameter restrictions. 

Results for Case 2, Nelson—Siegel (NS) model: True model yields, yield curves fitted with DEopt, 
and yield curves fitted with nlminb. Plotted are the results of 5 restarts for both methods, though 
for DEopt it is impossible to distinguish between the different curves. 


Results for Case 2, Nelson—Siegel-Svensson (NSS) model: True model yields, yield curves fitted 
with DEopt, and yield curves fitted with nlminb. Plotted are the results of 5 restarts for both 
methods, though for DEopt it is impossible to distinguish between the different curves. Compare 
with Fig. 16.9: if anything, the results for n1minb are worse. 


Results for Case 3, Nelson—Siegel-Svensson (NSS) model: True model yields, yield curves fitted 
with DEopt, and yield curves fitted with nlminb. Plotted are the results of 5 restarts for both 
methods, though for DEopt it is impossible to distinguish between the different curves. 

Results for Case 4, Nelson—Siegel-Svensson (NSS) model: True model yields, yield curve fitted with 
DEopt, and yield curve fitted with nlminb. Plotted are the results of 5 restarts for both methods, 
though for DEopt it is impossible to distinguish between the different curves. 
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Left: Convergence of solutions. Each distribution function shows the objective function value of the 
best population member from 1000 runs of DEopt. As the number of generations increases, the 
distributions become steeper and move to the left (zero is the optimal objective function value). 
Right: Convergence of population, i.e. objective function values across members at the end of 
DEopt run. The light-gray distributions are from runs with 100 generations; the darker (and steeper) 
ones come from 500 generations. Those latter distributions typically converge: all solutions in the 
population are the same. 

OF over time. The left panel shows F = 0.5, CR = 0.99. For the same problem, the right panel shows 
F=0.9,CR=0.5. 

The true value of 6; is 5. Left: without constraint (penalty weight is zero). Right: with constraint 

By <4. 

The effects of a single outlier. On 6 June 2006, Adidas, a large German producer of clothing and 
shoes, made a stock split 4-for-1. The data was obtained from www.yahoo.com in April 2009. 
According to www.yahoo.com the series in the upper panel is split adjusted. It is ironic that this is 
even true: but the price was adjusted twice, which resulted in the price jump in June 2006 (topmost 
panel). 

Objective function values obtained with 1qs with increasing number of samples (see code example). 
True 62 is zero. The outliers lead to a biased value of 0.9. 

True 62 is zero. The outliers lead to a biased value of 1.5. 

Reported parameters for the GARCH(1, 1) estimates from different restarts, all with identical 
(optimal) loglikelihood of —1106.60788104129. 

Implied volatilities for options on the S&P 500 (left) and DAX (right) as of 28 October 2010. 
Heston model: re-creating the implied volatility surface. The graphics show the BS-implied 
volatilities obtained from prices under the Heston model. The panels on the right show the implied 
volatility of ATM options. 

Bates model: re-creating the implied volatility surface. The graphics show the BS-implied volatilities 
obtained from prices under the Bates model. The panels on the right show the implied volatility of 
ATM options. 

Merton model: re-creating the implied volatility surface. The graphics show the BS-implied 
volatilities obtained from prices under the Merton model. The panels on the right show the implied 
volatility of ATM options. Importantly, jumps need to be volatile to get a smile (i.e., vy > 30%). 
Relative (left) and absolute (right; in cents) price errors for BS with direct integration with 25 nodes 
(compared with analytical solution). 

Relative (left) and absolute (right; in cents) price errors for BS with direct integration with 50 nodes 
(compared with analytical solution). 

Operations of Nelder—Mead. 

Absolute errors for Gauss—Legendre and Gauss—Laguerre compared with normcdf for an 
increasing number of nodes. Errors are plotted on the y axis; the x axis shows the number of nodes. 
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Foreword to the second edition 


Whenever a second edition of a book is written and published, the authors are expected to summa- 
rize what is new, and why this new edition is better than the first. Start with what has not changed: 
the spirit of the book. That spirit was and is that i) computational methods are powerful tools in 
finance, but ii) these methods must be applied with care and thought. Numbers may be precise, but 
what ultimately matters is the things that are being done with these numbers. 

What has changed in the financial world since the first edition? The most-glaring changes have 
been brought about by the financial crises 2007—08 and the crisis of the euro area, the latter still 
ongoing at the time of writing. One consequence was that quant approaches fell into disrepute, 
though this dent in popularity was short-lived. Deservedly so: the crisis did not happen because of 
models. That being said, the methods described in the first edition remain valid, though sometimes 
minor updates were necessary. For instance, interest rates could become negative after all. 

A more lasting change has been the growth in regulation. Unless you are into compliance, 
work has become less fun. But not all is bad: one potentially positive effect is the increase in 
transparency and available data. (We say potentially because it is too early to tell if it really has the 
desired effect.) By this we mean that more trading is pushed to exchanges; and more data are being 
collected, notably in areas such as bond trading and OTC derivatives, which may become available 
for research, e.g. into market microstructure. 

The financial crisis has had effects on consumer behavior too. Or rather, consumer perception. 
Before the crisis, people thought that technology firms posing as banks could not be trusted: they 
might be frauds, there might be security problems, and who knew whether a firm would be around 
a year later? And to be sure: all these risks were and still are real. What the financial crises showed, 
however, was that such risks are real for traditional banks, too. As a result, financial technology 
became more widely accepted. 

Of course, many technologies that are branded FinTech, TechFin, RoboAdvice, or whatever to- 
day, are just hype. But such technologies do offer the genuine possibility for better finance, with 
lower costs, fewer middlemen, and more efficiency. Here, the simulation and optimization tech- 
niques we discuss in Parts II and III of the book gain even more relevance. Let us provide a specific 
example: financial portfolios are often represented as weights instead of actual (integer) positions, 
because that makes them more tractable for classical optimization techniques. This is often declared 
harmless because for institutional investors, who run large portfolios, the differences between using 
decimal numbers and integers will be small.! But that does not hold for an investor whose portfo- 
lio is of only moderate size of 20,000 dollars, say (Maringer, 2005b, Chapters 3 and 4). But such 
investors are exactly those that should benefit from the new financial technologies. 

Since we are at technology, what else has changed? Computers have become even faster, and 
even more data are stored, often directly at technology firms large and small (these companies 
also do much more research). And artificial intelligence and machine learning are seeing another 
spring. Here again, Parts II and III are directly applicable, in particular the optimization techniques 
we discuss: after all, machine learning is in essence setting up a model and solving it (Goodfellow 
et al., 2016, Chapter 5). 


1. We should stress that such a claim could only be empirically verified by using such methods as we describe in Part III of 
the book. 
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xxiv Foreword to the second edition 


So let us come back to the book. As we said, we did not touch the spirit of the book: compu- 
tational methods are powerful tools, and they must be applied with care and thought. But we have 
added quite a bit of new material: 


e There is a new tutorial chapter on using heuristic optimization. 

e There is a new chapter on backtesting investment strategies. 

e The NMOF package has substantially expanded since the first edition: there are now functions 
for genetic algorithms, grid search, option pricing and much more; many of these functions are 
described in the new edition of the book. 

e The chapter on portfolio optimization has been expanded, and several code examples have been 
improved. 

e Material on parallel computing with both MATLAB® and R has been added. 

e Material on solving linear systems with R has been added. 

e Many of the existing R examples have been rewritten with Sweave (Leisch, 2002), making them 
completely reproducible. The new chapters are written with Sweave too. 


About R code 


Most R functions that were described in the first edition are included in the NMOF package, which 
has grown quite a bit over the years. Nevertheless, all R code examples from the first edition still 
work with the current version of R and the package (though several examples have been improved 
in this edition). 

Most of the old R code and all newly-added R code has now been prepared with Sweave; so 
code for graphics, tables, etc. is directly embedded in the source document. That does not mean 
that all code is always shown in the book: it would be tiresome having to read the same code 
for plotting a result, say, over and over again. However, the complete code is tangled and can be 
accessed on the book’s website http://www.nmof.info. It is also contained in the package (see the 
function showExamp1e). For those parts that use Sweave, all code for a single chapter is collected 
in one R source file. To make it easier to navigate these files, many code chunks have names, which 
are printed in the margin, as in the following snippet. 


> T pl 


In the R file, you may then look for the chunk’s name: 


FE HE HE HE HE HE E HE HE HE HE HE HE FE HE HE FE FE HE FE HE HE HE HE HE HE HE HE HE HE HE HE HE HE HE FE HE HE HE EHH E E E E HE H H 
### code chunk number 1: one-plus-one 

FE E HE E HE HE HE HE HE HE HE HE FE FE HE HE FE FE HE FE HE HE FE HE HE HE HE HE HE HE HE HE HE HE HE HE HE HE HE HE HE HE E E E E E E H H H 
1 +1 


When functions are shown whose code is included in the NMOF package, we typically have 
the code directly taken from the package. For this, the package was built and installed with 
-with-keep.source. When errors are intended, they are wrapped into try. 


Part | 


Fundamentals 


This page intentionally left blank 


Chapter 1 


Introduction 


Contents 

1.1 About this book 3 1.3 On software 7 
The growth of computing power 3 1.4 On approximations and accuracy 10 
Computational finance 4 1.5 Summary: the theme of the book 15 

1.2 Principles 6 


1.1 About this book 


“I think there is a world market for maybe five computers.” So said, allegedly, Thomas J. Watson, 
then chairman of IBM, in 1943. It would take another ten years, until 1953, before IBM delivered 
its first commercial electronic computer, the 701. 

Stories and quotations like this abound: people make predictions about computing technology 
that turn out to be spectacularly wrong.! To be fair, Watson probably never made that statement; 
most of such infamous forecasts are either made up or taken out of context. 

But in any case, Watson’s alleged statement reflects the spirit of the time, and reading it today 
plainly shows the extent to which computing power has become available: nowadays in everyone’s 
pocket, there are devices that perform millions of times faster than IBM’s 701. 


The growth of computing power 


Many examples have been made to illustrate this growth;” let us provide one that starts in the Swiss 
city of Geneva, where this book originated. In case you do not know Switzerland, Fig. 1.1 shows a 
map. 

Imagine you stand at the wonderful lakefront of Geneva, with a view on Mont Blanc. You decide 
to take a run to Lake Zurich. We admit this is unlikely, given the distance of 250 km. But it is only 
an example, so please bear with us. Assume you could run the whole distance with a speed of 20 km 
per hour, which is about the speed that a world-class marathon runner could manage. (We ignore 
the fact that Switzerland is a rather mountainous country.) It would take you more than 12 hours 
to arrive in Zurich. Now, instead of running, suppose you took a plane with an average speed of 
800 km per hour. This is about 40 times faster, and you would arrive in less than half an hour. 
What is the point of the example? For one, few people would run from Geneva to Zurich. But more 
importantly, we wanted to give you an idea of speedup. It is just 40 in this case, not a large number, 
but it makes a huge difference, and it makes things possible which otherwise would not be. 

This book is not about traveling. It is about computational methods in finance. Quantitative 
methods and their implementation in software form have become ever more important in scientific 
research and the industry over the last decades, and much has changed. On one hand, ever more 
data is collected and stored today, waiting to be analyzed. At the same time, computing power has 
increased beyond imagination. If we measure the speed of a computer by the number of operations 
it can perform in a second, then computers have improved by a factor of perhaps 1,500,000 since 


1. This applies as well to predictions about technology and to predictions in general. See, for instance, predictions on TV at 
http://www.elon.edu/e- web/predictions/150/1930.xhtml. 

2. One of our favorites is from Dongarra et al. (1998): “Indeed, if cars had made equal progress [as microprocessors], you 
could buy a car for a few dollars, drive it across the country in a few minutes, and ‘park’ the car in your pocket!” If applied 
to computing power in general, that is an understatement: it would now take fractions of a cent to buy a car, less than a 
second to cross the country — and you might need magnifying glasses to even find your car. 
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FIGURE 1.1 A map of the beautiful country called Switzerland. 


the early 1980s.° If traveling had matched this speedup, we could go from Geneva to Zurich in 3/100 
of a second. Better yet, we are not talking about supercomputers, but about the kind of computers 
we have on our desktops. And in any case, in the past, the power of supercomputers at one point in 
time was available for private use just a few years later. 

Of course, it is not only hardware that has improved, but software, too. If people still operated on 
terminals or had to use punch-cards for storing programs, computing would be much less powerful 
today. This book is about using this massively-increased computing power in finance. Indeed, this 
evolution, which took place in roughly a man’s work-life, has led to several consequences. 


Less division-of-labor. Because computers have become so fast and easy to use, people in many 
disciplines can turn from specialists into polymaths. Or, to put it differently, they become 
specialists in more-broadly-defined fields. We see this happening not only in finance; but 
also in statistics and data analysis, where often a single person prepares, processes, and ana- 
lyzes data, estimates models, runs simulations, and more (much helped by the fact that many 
such tasks can be automated). Or think of publishing. Modern computers and software have 
enabled people to not only write papers and books, but to actually produce them, i.e., create 
artwork or graphics, define the layout and so on.* 

Portable software. Software and computing models that rely on specific hardware architectures 
lose their appeal. For instance, parallel computations that exploit specific communication 
channels in hardware have become less attractive. Instead of spending the next year with 
rewriting their programs for the latest supercomputer architecture, people can and should 
now use their time to think about their applications and write useful software. (Then, after a 
year, they can buy a better, faster machine.) 

Interpreted languages. In the past, implementing algorithms often meant creating prototypes in a 
higher-level language and then rewriting such prototypes in low-level languages such as C. 
Today, prototypes written in languages such as MATLAB®, R, Python, Julia or Lua are so 
fast that a reimplementation is rarely needed. As a consequence, implementation times have 
decreased, and we can much faster explore new ideas and adapt existing programs. 


Computational finance 


Computational finance is not a well-defined discipline. It is the intersection of financial economics, 
scientific computing, econometrics, software engineering, and many other fields. Its goal is to bet- 
ter understand prices and markets—if you are an academic—, and to make money—if you are 


3. The first IBM personal computer in the early 1980s did about 10 or 20 KFLOPS (one FLOP is one floating point operation 
per second, see page 29). And the speedup does not take into account improvements in algorithms. 

4. Curiously enough, that had actually been the state of affairs before, as Jan Tschichold (1971) observed: “In der friihzeit 
des buchdrucks waren drucker und verleger eine und dieselbe person. Der drucker wählte selber die werke aus, die er 
verlegen wollte; oft war er selber entwerfer und hersteller der typen, mit denen er druckte; er beaufsichtigte selber den 
satz und setzte vielleicht selber mit. Dann druckte er den satz, und das einzige, was er nicht selber lieferte, war das papier. 
Nachher benötigte er vielleicht noch den rubrikator, der die initialen einzuschreiben hatte, und einen buchbinder, falls er das 
werk gebunden auf den markt brachte.” [Note that the capitalization is Tschichold’s.] 
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more practically inclined. This book is about tools for computational finance, with an emphasis on 
such techniques that have most benefited from the growth in computing power: simulation and op- 
timization techniques. More specifically, we will focus on computationally-intensive optimization 
methods: heuristics. 

The theme of the book is the practical, computational application of financial theory; we do 
not deal with the theory itself; in some cases, there may not even be much established theory. 
Quantitative finance today is very much driven by theory, which is often highly mathematical. But 
we feel that researchers should much more emphasize empirical and experimental results, similar 
to how data analysis set itself apart from statistics: the former emphasized working with data; the 
latter preferred mathematical reasoning. Hence, we will not develop ideas theoretically but consider 
quantitative analysis as a primarily computational discipline, in which applications are put into 
software form and tested empirically. For every topic in this book, we discuss the implementation 
of methods. Algorithms will be given in pseudocode; MATLAB or R code will be provided for all 
important examples.” 

The readers we have in mind are students at Master or PhD level in programs on quantitative 
and computational finance, and researchers in finance. The book could be the main reference for 
courses in computational finance programs, or an additional reference for quantitative or mathemat- 
ical finance courses. But the book will also be valuable to practitioners in banks and other financial 
institutions. Many of the chapters have the flavor of case studies. From a pedagogical view, this 
allows the learning of the required steps for tackling specific problems; these steps can then be 
applied to other problems. From the practical side the selected problems will, we hope, be relevant 
enough to make the book a reference on how to implement a given technique. We have also in- 
cluded many smaller code recipes for particular tasks, such as producing specific random variates 
or ‘repairing’ variance—covariance matrices, that are often used in practical applications. 

The ideal, prototypic kind of reader that we have in mind—“the Analyst’—does not work on 
theory, at least not exclusively, but his job requires a close interaction between theoretical ideas and 
computation with data. 

The book is structured into three parts. The first part, “Fundamentals,” begins with an intro- 
duction to numerical analysis, so we discuss computer arithmetic, approximation errors, how to 
solve linear equations, how to approximate derivatives, and other topics. These discussions will 
serve as a reference to which later chapters often come back. For instance, the Greeks for option 
pricing models can often be computed by finite differences. Unfortunately, numerical methods are 
rarely discussed in standard finance books, even though they are the basis for empirically applying 
and testing models. Beyond the initial chapters, the book will not discuss numerical methods in an 
abstract way. For a general treatment see, for instance, Heath (2005) or the concise and very rec- 
ommended summary given by Trefethen (2008b). We will only discuss numerical aspects that are 
relevant to our applications. Further chapters will explain deterministic numerical methods to price 
instruments, namely trees and finite difference methods, and their application to standard products. 

The second part, “Simulation,” starts with chapters on how to generate random numbers and 
how to model dependence. Chapter 8 discusses how time series processes, simple assets, and port- 
folios can be simulated. There is also a short example on how to implement an agent-based model. 
By means of several case studies, applications will illustrate how these models can be used for 
generating more realistic price processes and how simulation can help to develop an intuitive un- 
derstanding of econometric models and solve practical problems. 

The third part, “Optimization,” deals with optimization problems in finance. We start with 
fundamental—in the sense of ubiquitous—problems and methods to solve them, such as root find- 
ing. Indeed, root-finding algorithms may be the most widely used numerical tools in finance: they 
are needed, among other things, for computing the yield to maturity (a.k.a. internal interest rate) of 
bonds, and for computing the implied volatility of options. Both operations are performed millions 
of times each day across the financial world. We also discuss how to solve Least Squares problems, 
yet another fundamental problem. 


5. MATLAB and R are sufficiently similar so that code from one language can be translated into the other. In particular, both 
languages share the idea of vectorized computations. 
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Unfortunately, many optimization models cannot be solved with standard methods that are 


readily available in software packages. Thus, we are going to show how to use another class of 
techniques, so-called heuristics; in fact, we will spend most of the book’s third part on these tech- 
niques. Chapter 11 will detail building blocks of numerical optimization algorithms. Though we 
include gradient-based techniques such as Newton’s method, emphasis will be put on more robust 
techniques such as direct search. Chapter 12, “Heuristic Methods in a Nutshell,” will then move 
from these building blocks to specific techniques, and give an overview of heuristic methods. The 
remaining chapters will deal with specific problems from portfolio optimization, the estimation of 
econometric models, and the calibration of option pricing models. 


1.2 Principles 


We start with some principles, or guidelines, for the applications in the book. 


1. 


2. 


We don’t know much in finance. No, we are not being modest here. “We” refers to finance and 
the people working in finance in general (which includes us). A number of people may object 
to this statement, so let us clarify: what we mean by “not knowing much” is that there is little 
empirically founded and tested, objective knowledge in finance that can be confidently exploited 
in practical applications. True, there is a vast body of theory. But when we look at the empirical 
track record of much of this theory—let us say there remains room for improvement. 

This has an important implication: when it comes to empirical applications, it is difficult in 
finance to give general statements like “this model is better than that model” or “this method is 
better than that method.” This is less of a problem in numerical analysis, where we can quite 
objectively compare one method with another. But in finance, we really don’t know much. This 
makes it difficult to decide what model or technique is the appropriate one for a specific purpose. 
There is a second, related implication. If it is so difficult to tell what a good model is, then all 
that is said must reflect opinion, as opposed to facts. Books—like this one—reflect opinion. In 
fact, we use the “we” throughout, but this book is written by three authors, and the “we” does 
not mean that we always agree. Everything in this book reflects the opinion of at least one of 
us. We cannot claim that the techniques we discuss are the best and final answers, but all results 
come from extensive analyses, and we have found those techniques useful, both in research and 
in practice. 

We really want to stress this point. Some textbooks in finance give the impression that, for 
example, “if you want to do portfolio optimization, this is what you need to do.” No, please. 
It should be “This is one possibility, and if you have a better idea, please try.” This is all the 
more true since many models in finance were not motivated by empirical usefulness, but rather 
mathematical elegance or mere tractability. We will argue in this book that modern computing 
power (which we all have on our desktops) allows us to give up the need to adhere to rigid 
mathematical structures (no, an objective function need not be quadratic). 

Don’t be scared. Academic finance and economics has become ever more formal and math- 
ematical since the 1980s or so. In our experience, this can be intimidating, and not just to 
students. One way to overcome this is to show that even complicated models may be only a 
few lines of computer code and, thus, manageable. (So even for seemingly standard material 
like binomial trees, we will detail implementation issues which, to our knowledge, are rarely 
discussed in detail in other books.) Implementing a model gives more intuition about a model 
than only manipulating its equations. The reason: we get a feel of magnitude of possible errors; 
we see which difficulties are real, and which ones we need not bother with. And implementing a 
model is anyway the necessary first step to a test whether the model makes practical sense. The 
purpose of computational finance as stated above (both academic and practical) means that the 
goal is neither to write beautiful equations nor to write beautiful code, but to get applications 
to run. Writing reliable software is often important here, but reliable means that the software 
fulfills its purpose. 


6. We should have written it on the cover and made the book slightly less expensive than its competitors; see Adams (1979) 
for an explanation of this book-selling strategy. 
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3. Judge the quality of a solution with respect to the application. We are interested in methods for 
computational finance. And in our definition of computational finance, the relevance of a model 
is determined by whether it fulfills its purpose. So it should not be judged by 

(i) how elegant the mathematics is; 

(ii) whether it nicely fits into a theoretical framework (e.g., if a model is built to help forecast 
interest rates, our concern should not be whether the model excludes arbitrage opportuni- 
ties); 

(iii) how elegantly or in what language the code is written (of course, if the purpose of a 

program is to compute an answer fast, efficiency becomes a goal). 
There is an important corollary here. When we judge quality, we must look at magnitudes. There 
is no doubt that, mathematically, 0.0000001 > 0; but practically 0.0000001 may be the same as 
0. How a model and its solution help with the actual problem needs to be explicitly studied. 
Such analysis typically is difficult and not very precise, but carefully exploring, quantifying and 
discussing the effects of specific model choices is always better than dismissing such analysis 
as “out-of-scope.” 

4. Don’t translate math literally into computer code. Mathematically, for a vector e, computing 
>; e? may be equivalent to computing e'e, but this need not be the case on a computer. While 
rounding errors are rarely relevant (see the previous rule and Chapter 2), there may be substantial 
differences in necessary computing time, which may in turn compromise the performance of a 
model. 

5. Go experiment. When you implement an idea, try things, experiment, work with your data, be 
wild. The tools described in this book will, we hope, help to better test and implement ideas. 
Eventually, it is the applications that matter; not the tools. We do not need a perfect hammer, 
the question is where to put the nail. 


1.3 On software 


We deal with the practical application of numerical methods, so we need to discuss software. The 
languages of choice will be MATLAB and R (R Core Team, 2018), sample code will be given in the 
text when it is discussed. All code can be downloaded from our website http://www.nmof.net. For 
the R code we have created a package NMOF.’ 

We needed to make a decision how to present code. The main purpose of giving sample pro- 
grams is to motivate readers to use the code, to experiment with it, to adapt it to their own problems. 
This is not a book on software engineering; it is a book on the application of numerical methods in 
finance. 

Thus, we want simple, easy-to-understand, short code. After all, the Analyst’s job (see page 5) is 
not to design large-scale software architectures, but to analyze and compute with data. The Analyst 
will want to test ideas. But little would be gained if the answers we get from our software are not 
reliable. Thus, software matters a lot, and it has become much easier in recent years to do reliable 
computations: not just because of mature software packages such as MATLAB and R, but also 
because of many other software tools that are freely available: databases, command-line utilities, 
and all the great tools for doing reproducible research. For instance, several chapters of the book 
are written in Rnw format and processed with Sweave (Leisch, 2002), thus providing the complete 
code for figures and tables. 

Nevertheless, for the presentation in the book, a trade-off had to be made. In the printed book, 
we go for brief code. We will only sparingly discuss design issues such as reusability,® or error 
handling. For R code included in the NMOF package, we will often show abbreviated versions. If 
you read the full source code, you will often find tests and checks. In any case, our goal is not to 
provide code that can be used without thinking of what the code does or if it is appropriate for a 
given problem. 


7. One should always check if chosen acronyms are used in different contexts; this may spare some quite embarrassing 
situations. Apparently, for scuba divers, NMOF means “no mask on forehead”, as opposed to MOF. Sharing the package’s 
name with this abbreviation seems acceptable. 

8. If you are interested in software design, we suggest you visit the QuantLib project at http://quantlib.org, or read Joshi 
(2008) or Hyer (2010). 
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At many places throughout this book we will give suggestions on how to make code faster.” 
We are aware that such suggestions are dangerous. Hardware changes, and software as well. For 
example, if you want to implement binomial trees, we suggest Chapter 5, but also Higham (2002). 
Higham compares various implementations of a binomial tree with MATLAB; his files can be down- 
loaded from http://personal.strath.ac.uk/d.j.higham/algfiles.html. A straightforward version of a 
tree uses two loops, one over prices and one over time. Higham finds that when the inner loop of 
the tree (going through the prices) is vectorized, he gets a speedup of 25 or so. When we run the 
programs today, the speedup is still impressive (say, 2-4, depending on the number of time steps), 
but clearly less than 25. Why is that? MATLAB since version 6.5 has included a Just-In-Time (JIT) 
compiler (MathWorks, 2002) which accelerates loops, !? often to the point where vectorization does 
not give an advantage anymore. If you want to get an idea of the speed without compilation, you 
can disable the JIT compiler: 


feature accel off 
% run code here 
feature accel on 


The moral of this example is that (i) Higham knew what he wrote about, and (ii) memorizing rules 
cannot replace experimentation. 

So we want efficient code, but does this not go against Rule 3? Not necessarily. True, efficiency 
(in the sense of fast programs) is a constraint, not a goal per se. If we need to compute a solution to 
some problem within 2 minutes, and we have a program that achieves this, there is no need to spend 
days (or more) to make the program run in 5 seconds. But there is one application that is crucial 
in finance for which a lack of speed becomes a hindrance: empirical testing. We will rarely devise 
a model, implement it, and start to use it; we will start by testing a model on historical data. The 
faster the programs work, the more we can test. Also, we may want to scale up the model later on, 
maybe with more data or more assets, hence efficient implementations become important. 

But then, if we are interested in efficient code, why did we not use lower-level languages like 
C in the first place? Several reasons: first, it is a matter of taste and path dependency.'! Second, 
for the applications that we looked into, MATLAB and R were always fast enough. Third, and most 
importantly, the Analyst’s goal is exploration, and then the time to implement a workable version 
of an algorithm matters. And here we think that MATLAB and R are superior. In any case, we will 
present many algorithms as pseudocode in which we do not use high-level language (e.g., a loop 
will always be written as a loop, not as a vectorized command), hence coding is possible for other 
languages as well. 

We will not implement all methods from scratch. A rule in programming is not to rewrite al- 
gorithms that are available in implementations of sufficient quality. Still, even available algorithms 
should be understood theoretically. Examples of this type of software include solvers for linear 
equations and other routines from linear algebra, collected in libraries such as LAPACK. To be con- 
crete, let us give an example why it makes sense to know how an algorithm works. Suppose we 
wanted to find the Least Squares solution for an overidentified linear system 


X0x y, (1.1) 


in which X is an m x n matrix, m > n, and y and @ are vectors of length m and n, respectively. 
We try to find 0 that minimizes || X0 — y||2 . Solving such equations is discussed in Chapter 3. The 
Least Squares solution is given by solving the normal equations 


X'xX0=X'y, 


9. When testing code, we mostly used R version 3.5.1 and MATLAB version R2017a. R was tested on a workstation with an 
Intel Core i7-5820K CPU @ 3.30GHz and 64GB RAM, running Lubuntu 18.04, with OpenBLAS (http://www.openblas. 
net/) installed. MATLAB was tested on a x64-based PC with an Intel Core i7-6600U CPU @ 2.60GHz, 2808MHz, 2 Cores, 
4 Logical Processors running Windows 10. 

10. R introduced a JIT compiler in version 2.13, released in 2011. 

11. It is safe to say that the three of us are or at least at some point were MATLAB fans. One of us bought his first license 
in 1985 and has used MATLAB ever since. 
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but we need not explicitly form X’X. Rather, we compute the QR decomposition of X, where Q 
ism x mand R is m x n. Suppose we need a fast computation—maybe we have to solve Eq. (1.1) 
many times—and m is far greater than n. Then working with the normal equations (i.e. really 
computing X’X and X'y) may be more practical (see Section 3.4). Importantly, both approaches 
are equivalent mathematically, but not numerically. The following code examples for MATLAB and 
R show that there are sometimes substantial performance differences between methods. In R, we 
use the rbenchmark package (Kusnierczyk, 2012) to measure and compare computing times. 


Listing 1.1: C-Introduction/M/./Ch01/equations.m 


1] 3% Time-stamp: <2018-02-05> 
2|%% (also tested with Octave) 

3 

4)m = 10000; n = 10; trials = 100; 
5X = randn(m,n); y = randn(m,1); 
6 

7| 3% QR 

8} tic 

9| for r = 1:trials 

10 soll = X\y; 

11} end 

12| toc 

13 

14) 3% form (X’X) and (X'y) 

15| tic 

16| for r = 1:trials 

17 sol2 = (X’*X)\(X' xy); 

18| end 

19| toc 
20 
21|%% check whether results are the same 
22| max (abs (sol11(:) - sol2(:))) 


> m <- 100000 ## number of rows 
> n <- 10 ## number of columns 


S of a Eucicehy aaco a e i), Chin = e im)))) 
> y <- rnorm(m) 


> ## QR decomposition 
= ise Cpe <= seuiaeieseral(<, S7) 
qr.solve(X, y) 


> ## form (X’X) and (X'y) 
> fit_normal_eq <- function(X, y) 
solve(crossprod(X), crossprod(X,y) ) 


> ## Cholesky 
> fit_cholesky <= function(xX, y) { 
C <- chol (crossprod(X) ) 
rhs <- crossprod(X, y) 
backsolve(C, forwardsolve(t(C), rhs) ) 


> ## check whether the solutions are the same 
> isTRUE(all.equal ( 
ete nek, Ww), 
c(fit_normal_eq(X, y)))) ## c() drops the dimension 


[linear-systems] 
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E 


[1] TRUI 


> ## check whether the solutions are the same 
> isTRUE (all.equal( 

fit_normal_eq(X, y), 

fit_cholesky (xX, y))) 


[1] TRUI 


@ 


> ## compare speed 

> library ("rbenchmark" ) 

> benchmark ( 
iene CHE ES 47), 
fit_normal_eq(X, y), 
fit_cholesky(X, y), 
oen = Trelartivat i liy a4] 


test replications elapsed relativ 


3 fit_cholesky(X, y) 100 0.256 1.00 
2 fit_normal_eq(X, y) 100 0.285 Ls 
1 fit_qr(X, y) 100 1.423 5.56 


Another example for software that should not be written from scratch are random number gen- 
erators, in particular for uniform variates (which are the basis for other distributions). Knowledge 
of how variates are generated is helpful, but the rule is to trust MATLAB and R (meanwhile, both 
use the Mersenne Twister method). This does not mean that such implementations are necessarily 
bug-free;!* but they have been tested extensively, and are still being tested every day by thousands 
of users. It is more likely that programming our own generators will lead to errors. 

Nevertheless, we will often need to write our own software. This may include pricing equations, 
trees, or lattices for pricing particular financial instruments, for instance in proprietary software; 
but in particular the simulation algorithms described in Part II and the optimization algorithms in 
Part III, which in turn rely on low-level libraries such as LAPACK. 


1.4 On approximations and accuracy 


Many of the models discussed in this book are optimization models. Essentially, any financial 
model can be rewritten as an optimization problem, since a model’s solution is always the entity 
that best fulfills some condition. (Of course, just because we can rewrite a model as an optimization 
problem does not mean that handling it like an optimization problem needs to be the most conve- 
nient way to solve it.) In setting up and solving such a model, we necessarily commit a number of 
approximation errors. “Errors” does not mean that something went wrong; these errors will occur 
even if all procedures work as intended. In fact, the essence of numerical analysis is sometimes 
described as approximating complicated problems with simpler ones that are easier to solve. But 
this does not mean that we lose something. The notion of an “approximation error” is meaningless 
if we do not evaluate its magnitude. This book is not a text on generic numerical methods, but 
on financial models, and how they are solved. So we should compare the errors coming from the 
numerical techniques to solve models with the overall quality of the model. 

Let us discuss these errors in more detail. (A classic reference on the analysis of such errors is 
von Neumann and Goldstine, 1947. The discussion in Morgenstern, 1963, Chapter 6, goes along 
the same lines.) The first approximation error comes when we move from the real problem to the 
model. For instance, we may move from actual prices in actual time to a mathematical description of 


12. Excel’s random number generator has a particularly bad reputation. In Excel 2003, for instance, the initially-shipped 
version generated negative uniform variates. 
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the world, in which both prices and time are continuous (i.e., infinitely-small steps are possible). Or 
we may characterize an investment portfolio as a random variable whose distribution is influenced 
by the chosen asset weights, and we try to select weights so that we obtain a favorable return 
distribution. But this leaves out many details of the actual investment process. 

Any such model, if it is to be empirically meaningful, needs a link to the real world. This 
link comes in the form of data, or parameters that have to be forecast, estimated, simulated or 
approximated in some way. Again, we have another source of error, for the available data may or 
may not well reflect the true, unobservable processes in which we are interested. 

When we solve such models on a computer, we approximate a solution; such approximations 
are the essence of numerical analysis. At the lowest level, errors come with the mere representation 
of numbers. For instance, MATLAB or R may agree that 0.1 + 0.1 equals 0.2. Let us check in R; 
MATLAB will give the same results. 


a 
4 
gd 
G 
(Ba 


At the same time, both MATLAB and R consider the next expression as FALS] 


ea] 


> Oal a Oo + O11 =s 03 


[1] FALSE 


That is no bug: a computer can only represent a finite set of numbers exactly. Any other number 
has to be rounded to the closest representable number, hence we have what is called roundoff error, 
explained in more detail in Chapter 2. 

Next, many functions (e.g., the logarithm) cannot be computed exactly on a computer, but need 
to be approximated. Operations like differentiation or integration, in mathematical formulation, 
require a going-to-the-limit, that is, we let numbers tend to zero or infinity. But that is not possible 
on a computer, any quantity must remain finite. Hence, we have so-called truncation error. For 
optimization models, we may incur a specific variety of this error: some algorithms, in particular 
the methods that we describe in Part III, are stochastic, hence we do not (in finite time) obtain the 
model’s exact solution, but only an approximation (notwithstanding other numerical errors). 

In sum, we can roughly divide our modeling into two steps: from reality to the model, and then 
from the model to its numerical solution. Unfortunately, large parts of the computational finance 
literature seem only concerned with assessing the quality of the second step, from model to im- 
plementation, and attempt to improve there. In the past, a certain division of labor may have been 
necessary: the economist created his model, and the computer engineer put it into numerical form. 
But today, there remains little distinction between the researcher who creates the model and the 
numerical analyst who implements it. Modern computing power allows us to solve incredibly com- 
plex models on our desktops: John von Neumann and Herman Goldstine, in the above-cited paper, 
describe the inversion of “large” matrices where large meant n > 10. In a footnote (fn. 12), they “an- 
ticipate that n ~ 100 will become manageable.” Today, MATLAB or R invert a 100 x 100 matrix on 
a normal desktop PC in a fraction of a second. But see Chapter 3 before you really invert a matrix. 

If the Analyst is now both modeler and computer engineer, then of course, the responsibility 
to check the reasonableness of the model and its solution lies—at all approximation steps—with 
the Analyst, and then only evaluating problems at the second step, from model to implementation, 
falls short of what is required: any error in this step must be set into context, we need to compare 
it with the error introduced in the first step, when setting up the model. Admittedly, this is much 
more difficult, but it is necessary. 

To give a concrete example: when we simulate a stochastic differential equation (SDE) on a 
computer, we will approximate it by a difference equation. Thus, we will generally introduce a 
discretization error (except for the rare cases in which we have a solution for the SDE). A typical 
application of such SDEs is option pricing with the Monte Carlo method (see Section 9.3). It is 
then sometimes stressed that in such cases, we can accept a less-precise result from the Monte 


[true] 


[false] 
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Carlo procedure because we anyway have a discretization error. It is rarely stressed that we can 
also accept a less-precise result because the SDE may be a poor approximation of the true process 
in the first place. 

Suppose we accept a model as “true,” then the quality of the model’s solution will always be 
limited by the attainable quality of the model’s inputs. Appreciating these limits helps to decide 
how “exact” a solution we actually need. This decision is relevant for many problems in finan- 
cial engineering: We generally face a trade-off between the precision of a solution and the effort 
required (most notably, computing time). Surely, the numerical precision with which we solve a 
model matters; we need reliable methods. Yet, empirically, there must be an adequate precision 
threshold for any given problem. Any improvement beyond this level cannot translate into gains 
regarding the actual problem any more; only in costs (increased computing time or development 
costs). For many finance problems, we guess, this required precision is not high. 


Example 1.1 
We often work with returns instead of prices. Given a financial instrument's price S, (where the subscript 
denotes time), the discrete return r;+; between ¢ and t + 1 is defined as 
diser _ S41 — St _ AS 
Fi = = 
St St 


where AS;+1 = S:+1 — S;. Log returns are given by 
log 
Til = log S;+ı — log S;. 


When we compute portfolio returns, we can aggregate the returns of single assets on the portfolio 
level by working with discrete returns; for log returns this is not true. But practically, the difference 
between the returns is small. We have 


ASr+1 lo (Se ey AS}+1 ASi41 (= J 1 [ASN 
S, Es ee S, Ss 2\ 5, 3U s, “i 
1 (ASH) ol (Aay 
2 St Sı f 


We have used the fact that the Taylor expansion of log(1 + x) is 


x? x xl 


x- +- 

2 3 4 
For many practical purposes there is no difference between log returns and discrete returns; any error 
arising from choosing the wrong method will be swamped by estimation error; see the next example. 


Example 1.2 


In numerical analysis, the sensitivity of a problem is defined informally as follows (see Chapter 2): if 
we perturb an input of a model, the change in the model’s output should be reasonably proportional 
to the input change. If the impact is far larger, the problem is called sensitive. Sensitivity often is not a 
numerical problem; it rather arises from the model or the data. In finance, models are sensitive. 

Fig. 1.2 shows, in its left panel, the S&P500 during 2009. The index level rose by 23%. By 23.45%, 
to be more precise, from 903.25 to 1115.10. But does it make sense to report this number to such a 
precision? We run a small experiment. We first transform the daily index levels into returns. Then, we 
randomly pick two observations—less than one percent of the daily returns—and delete (“jackknife”) 
them; then we compute the yearly return again. Repeating these steps 5000 times, we end up with a 
distribution of returns; it is pictured in the right panel of Fig. 1.2.1? The median return is about 23%, but 


13. When we delete only two observations, we would not have needed jackknifing: listing all possibilities would lead to 
about 30,000 yearly returns to be computed. This is more than 5000, but it still would take only a fraction of a second. But 
using a sampling approach allows us to adjust the procedure more easily: for instance, deleting between 1 and 5 returns per 
year would lead to about 8 billion possibilities. 
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FIGURE 1.2 Left: The S&P 500 in 2009. Right: Annual returns after jackknifing two observations. The vertical line gives 
the realized return. 


the 10th quantile is 20%, the 90th quantile is 27%, the minimum is only about 11%, and the maximum 
is 34%! Apparently, tiny differences like adding or deleting a couple of days cause very meaningful 
changes. 

Of course, you may argue that 2009 was not a normal year: the American bank Lehman brothers 
had collapsed in the autumn of 2008, bringing the global financial system to the brink of desaster. But 
nevertheless, crises happen in financial markets. So we repeated the analysis for the years 1987-2017. 
The results are collected in Table 1.1. It illustrates that 2009 was a volatile year, no doubt; but it was not 
unique. 

This sensitivity has been documented in the literature (for instance in Acker and Duck, 2007, Dim- 
itrov and Govindaraj, 2007), but it is often overlooked or ignored. Hence the precision with which 
point estimates are sometimes reported must not be confused with accuracy. We may still be able to 
state qualitative (“This strategy performed better than that strategy.”) and quantitative (“How much bet- 
ter? About 2% per year.”) results, but we should not make single numbers overly precise. We need 
robustness checks: instead of single numbers, report ranges and distributions. Returns are the empirical 
building blocks of many models. If these simple calculations are already that sensitive, we should not 
expect more-complex computations based on them to be more accurate. 


Example 1.3 


The theoretical pricing of options, following the papers of Black, Scholes, and Merton in the 1970s, is 
motivated by an arbitrage argument according to which we can replicate an option by trading in the un- 
derlier and a risk-free bond. A replication strategy prescribes to hold a certain quantity of the underlier, 
the delta. The delta is changing with time and with moves in the underlier’s price, hence the options 
trader needs to rebalance his positions. Suppose you live in a Black-Scholes world. You just sold a 
one-month call (strike and spot price are 100, no dividends, risk-free rate is at 2%, volatility is constant 
at 30%), and you wish to hedge the position. There is one deviation from Black-Scholes, though: you 
cannot hedge continuously, but only at fixed points in time (see Kamal and Derman, 1999). 

We simulate 100,000 paths of the stock price, and delta-hedge along each path (see Higham, 2004). 
We compute two types of delta: one is the delta as precise as MATLAB can do; one is rounded to 
two digits (e.g., 0.23 or 0.67). The following table shows the volatility of the hedging error (i.e., the 
difference between the achieved payoff and the contractual payoff), in % of the initial option price. (It 
is often helpful to scale option prices, for example, price to underlier, or price to strike.) Fig. 1.3 shows 
replicated option payoffs. 


Frequency of rebalancing With exact delta With delta to two digits 
Once per day 18.2% 18.2% 
Five times per day 8.3% 8.4% 


The volatility of the profit-and-loss is practically the same, so even in the model world, nothing is 
lost by not computing delta to a high precision. Yet in research papers and books on option pric- 
ing, we often find prices and Greeks to 4 or even 6 decimals. Here is the typical counterargument: 
“True, for one option we don’t need much precision. But what if we are talking about one million 
options? Then small differences matter.” We agree; but the question is not whether differences matter, 
but whether we can meaningfully compute them. If your accountant disagrees, suggest the following 
rule: whenever you sell an option, round up; when you buy, round down. Between buying one option 
or buying one million options, there is an important difference: you simply take more risk. And then 
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TABLE 1.1 Price paths and yearly returns of the S&P 500. The actual index level 
is used, so dividends are excluded. The table shows the results of the jackknif- 
ing experiment: a density estimate for the yearly returns, centered about the 
actual return. The vertical lines show —10, —2, 0, 2, and 10%. We also include 
the difference between the 75th and 25th quantile of jackknifed yearly returns, 
and the difference between the 99th and 1st quantile (in percentage points). 


Year Actual return in % Jackknifing experiment | 
Density of returns Q75 - Q25 Q9 - Qa 
way e 2.0 2.4 15.9 


1988 paa 12.4 
1989 E 27.3 
1990 adie cas —6.6 
1991 prea 26.3 
1992 gent” 4.5 
1993 angran 7.1 
1994 Pey =15 
1995 poze 34.1 
1996 aor 20.3 
1997 peas 31.0 
1998 ee 26.7 
1999 acting 19.5 
2000 mln —10.1 
2001 cae —13.0 


eee) 
cee 
SARS 
E 
YN 
VN 
N 
BANE 
ts 
eee 
e e 
ES 
ey ce eas 
pee cb eae 
2002 eee ae ae NS 2 8.4 
ZAS 
N 
YN 
YN 
ie 
eee, 
LA ee 
ee Sos 
fee enh 
YN 
PRS 
aes 
2 = 
YN 
AL 


19) S) 
1.6 HA 
1.7 6.6 
Wa) 7.8 
1 4.4 
1.0 4.3 
1.1 4.4 
ileal 4.4 
1.5 6.0 
2.5 13.2 


2.6 13.3 


2003 meg 26.4 
2004 Hean 9.0 
2005 wey 3.0 
2006 mareig 13.6 
2007 arp N 3.5 
2008 aea —38.5 
2009 wen 23.5 
2010 OO 12.8 
2011 NN 0.0 
2012 oN 13.4 
2013 Pearman 29.6 
2014 cee 11.4 
2015 a Wad —0.7 
2016 mae 9.5 
2017 paent 19.4 
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FIGURE 1.3 Payoff of replicating portfolios with delta to double precision (left), and delta to two digits (right). 


1 


using more numerical precision does not help; instead, single numbers should be replaced by ranges of 
outcomes. 


.5 Summary: the theme of the book 


So what to make of it all? Here is our opinion: 


Quantitative models are very useful in finance. Not for every purpose, clearly; but in general we 
think that finance is far better off with quantitative methods. Properly applying such methods 
requires sound implementations. 

But: we think that we should not go overboard with it. The perfectionism that numerical ana- 
lysts employ to refine methods is not just unnecessary in finance, it is misplaced. In principle, 
there would seem little cost in being as precise as possible. But there is. First, highly precise 
or “exact” solutions give us a sense of certainty that is never justified by a model. Secondly, 
computing more-precise solutions will require more resources. This does not just mean more 
computing time, but often also more time to develop, implement, and test a particular method, 
and hence—economist!—it is simply a misallocation of resources. 

This is actually good news. It means that when we are pondering ideas and possible models, 
we should not be too much concerned with whether we can eventually solve a model with high 
precision. In sum, with all the things said before as a caveat: we can numerically solve any 
financial model, or—to put it the other way around—we doubt that there is any financial model 
that failed because it could not be handled numerically with sufficient precision. 


Numerical issues are not the main source of trouble in finance (they can be a nuisance, though). 
All we require are solutions to models that are good enough. And good enough seems quickly 
achieved in finance. To be clear, we are not suggesting to needlessly throw away numerical 
precision. If an algorithm solves a given model to double precision, then by all means use it. 
But very often, we face trade-offs, and numerical precision must be bought with more effort 
(such as computing time), or by making sacrifices, such as forcing restrictions and simplifica- 
tions upon a model. In such cases, we suggest that the trade-off be explicitly analyzed. In our 
experience, the results will often be that not much precision is needed. So as a general rule: 
do not despair when it comes to numerical issues, but compare them with the accuracy of the 
actual model. If the model’s quality is already limited, then do not waste your time thinking 
about fourth decimals. Think of quantitative finance as gardening; we need sturdy tools for it, 
but no surgical instruments. 
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Chapter 2 


Numerical analysis in a nutshell 
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2.1 Computer arithmetic 


“All the trouble” in numerical computation comes from the fact that during computation, it can 
happen that 1 + € = 1 for e £0. In other words, a computer, generally, cannot perform exact real 
number arithmetic. 

There are two sources of errors appearing systematically in numerical computation. One is 
related to the simplification of the mathematical model behind the problem to be solved, as for 
instance the replacement of a derivative by a finite difference or other discretizations of a continuous 
problem. These errors are termed truncation errors. The second source is due to the fact that it is 
impossible to represent real numbers exactly in a computer, which leads to rounding errors. 

The support for all information in a computer is a “word” constituted by a sequence of generally 
32 bits,! a bit having a value of 0 or 1. 


231 230 23 22 21 20 
0/;0/;0;/0]--- ---]010|1ļ10|1ļ0 


This sequence of bits is interpreted as the binary representation of an integer. Integers can thus 
be represented exactly in a computer. If all computations can be made with integers, we are dealing 
with integer arithmetic. Such computations are exact. 


Representation of real numbers 
In a computer, a real variable x Æ 0 is represented as? 
x=tn x be 


where n is the mantissa (or fractional part), b is the base (always 2), and e is the exponent. For 
instance, the real number 91.232 is expressed as 0.71275 x 27 (or with base 10, we have 0.91232 x 
103. 


1. This is why computers are also called binary machines. More recent technology is moving toward 64 bits. 
2. The standards for the representation of real numbers have been set by the IEEE Standards Association, and we only 
present the basics. 
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In order to code this number, we partition the word in three parts, one containing the exponent 
e, the other the mantissa n, and the first bit from the left indicates the sign. 


eee ee 
E n 


Whatever the size of the word, we dispose of only a limited number of bits for the representation 
of the integers n and e. As a consequence, it is impossible to represent all real numbers exactly. 
In order to keep the illustration manageable, let us illustrate this fact by considering a word with 


6 bits, in which t = 3 is the number of bits reserved for the mantissa and s = 2 is the number of bits 
reserved for the exponent. One bit is needed for the sign. 


0/010 1;0/0/4 
oO; 1) 1 1/0115 

+ e= n = 
LULL) 1/02 1l 1lol6 
e 1/113 1/11117 


Normalizing the mantissa, that is imposing the first bit from the left being 1, we have n € 
{4, 5, 6, 7} and defining the offset of the exponent as o = 2571 — 1, we have 


0—0o)<e<(? -1-0o), 
that is e € {—1, 0, 1, 2}. The real number x then varies between 
(Cas enh eae 


Adding 1 to the right of (2*7! < n < 2! — 1) x 2° we getn < 2‘, and multiplying the whole expres- 
sion by 2~', we obtain 


OQ ene sco. 


The following table reproduces the set of positive real numbers f =n x 2°~' for a word with 6 
bits, t = 3 ands =2. 


ge-t 
n e=-l1 e=0 e=1 e=2 
1/16 1/8 1/4 1/2 
1 | o | o |=2=4 1/4 1/2 1 2 
1|o] 1 |=2+2=5 5/16 5/8 5/4 5/2 
1] 1] 0 | =22+2'=6 3/8 3/4 3/2 3 
1] 1] 1 |=2+2!+2=7 7/16 7/8 7/4 7/2 


Moreover, we observe that the representable real numbers are not equally spaced, indeed the 
grid becomes coarser when moving to larger numbers. 


HHH 4 H + H + H ł > 
174 1 2 3 Tio 


Modern software such as MATLAB® uses double precision (64-bit words) with t = 52 bits for 
the mantissa and e € [— 1023, 1024], the range of integers for the exponent. 

The range of the set f of representable real numbers is then given by m < f < M with m ~ 
2.22 x 107308 and M ~ 1.79 x 10°98. This is a quite impressive range (just for comparison, the 
number of electrons in the observable universe is estimated to be of the order of 108°). 
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If the result of an expression is smaller than m, we get an underflow, which then results into a 
zero value.° If the result of a computation is larger than M, we are in a situation of overflow that 
then results into an Inf that in most cases will destroy subsequent computations. 

The notation float(-) represents the result of a floating point computation. A computer using the 
arithmetic of the preceding example (t = 3 and s = 2) would produce with chopping’ the following 
numbers, float(1.74) = 1.5 and float(0.74) = 0.625. 


float(0.74) float(1.74) 
l | 
I + t + t t > 
0 t 4 t 2 3 
0.74 1.74 


The result for float(1.74 — 0.74) = 0.875 illustrates that even if the exact result is in the set of 
representable numbers, its representation might be different. 


Machine precision 


The precision of floating point arithmetic is defined as the smallest positive number €,,,., such that 
float(] + Eman) > 1. 


The machine precision €,,,,, can be computed with the following algorithm: 


macl 


e=1 

while 1 + e > 1 do 
e=e/2 

end while 

Eman = 2 € 


WN ee 


aÈ 


With MATLAB we have Ema, © 2.2 x 107!°, that is, digits beyond the 16th position on the right 
are meaningless. 

Note that floating point computation is not associative. Indeed for e = €man/2, we obtain the 
following results: 


float (float(1 +e) +e) = 1, 
float (1 + float(e + e)) > 1. 


Example of limitations of floating point arithmetic 


Sometimes the approximation obtained with floating point arithmetic can be surprising. Consider 
the expression 


[e6] 
Yk = 00. 
k=1 


Computing this sum will, of course, produce a finite result. What might be surprising is that this 
sum is certainly smaller than 35. The problem is not generated by an underflow of !/k or an overflow 
of the partial sum }°7_, !/k but by the fact that for n satisfying 


3. MATLAB denormalizes small numbers and handles expressions until 107324, 


4. As opposed to perfect rounding, chopping consists of cutting digits beyond a given position. 
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the sum remains constant. To show this, we consider e€ < €,,,., and we can write 


1 


float ( 1 + e ) 
float( Dpi + Yn ) = DET 


float ( i p Js ie 
Oa yee 


In the figure below, we show how the function Xz i 1/k grows for n going up to 10!!. 


50 
40 


10° 108 10° 10° 10" 1014 


From now on we consider only the sequence of values of n written as n = 10°, i =1,2,... and 


the corresponding sum S; = D 1/k and compute the order of n for which the computed sum 
will stop to grow. From the above figure we conclude that S; grows linearly with 7 thus we can 
extrapolate values of Sj, j > 11, i.e. S$; = S11 + 811/11 x (j — 11). According to what has been said 
above the sum will stop to grow if noe < Emach- 

J 


Considering rounded values for the sums we have S1; ~ 26 and S14 ~ 33 and we establish that 


1/10!4 


S14 


~3x 107!6. 


Given that €n 2 x 107!© we conclude that for n of the order n!* the sum stops to grow. 

As 3 operations per step are needed, for a computer with a performance of 1 Gflops, the sum 
will stop to increase after (3 x 10'*)/10? = 3 x 10° seconds, i.e. some 4 days of computing without 
exceeding the value of 35. 


2.2 Measuring errors 


In numerical computations, frequently, one has to measure the error e between an approximated 
value £ and an exact value x or, for instance, the distance between two successive values in an 
iterative algorithm. The absolute error defined as 


|x — x| 


is a very natural way to do this. However, this is not suitable for all situations. Consider the case 
in which £ = 3 and x = 2; we have an absolute error of 1, which cannot be considered “small.” On 
the contrary for = 10° + 1 and x = 10°, an absolute error of 1 can be considered “small” with 
respect to x. 

The relative error defined as 


|x -x| 


|x| 


for x Æ 0 avoids the problem illustrated earlier. The relative error for the previous example is 0.5 
and 107°. 

If x = 0 (or very close to zero), the absolute error will do the job, but the relative error will not. 
It is possible to combine absolute and relative errors with the following expression 


|x — x| 
x| +17 
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This definition then avoids the problems if x is near zero; it has the properties of the relative error 
if |x| >> 1 and the properties of the absolute error if |x| « 1. 


2.3 Approximating derivatives with finite differences 


Instead of working with the analytical derivative of a function, it is possible to compute a numerical 
approximation considering finite increments of the arguments of the function. Such approximations 
are easy to compute, their precision is generally sufficient, and, therefore, they are very convenient 
in numerical computations. 

We consider a continuous function f : R — R, the derivative of which is 


_ fath)— fa) 
SS 


f 
x)=1 2.1 
f œ h>0 h ae! 
If, instead of letting h go to zero, we consider “small” values for h, we get an approximation for 
f'(x). An immediate question is then: How to choose h to get a good approximation if we work 
with floating point arithmetic? 


We consider f with derivatives of order n + 1. For finite h, the Taylor expansion writes 


TEOZOFO EP Wt ESO +--+ EFM) + Rw Hh) 


n! 
with the remainder 


h”+! 


Rix +h) = (n+)! fT ME), Ee[x,x+h], 


where é is not known. 
Thus, Taylor series allow the approximation of a function with a degree of precision that equals 
the remainder. 


Approximating first-order derivatives 


Now, consider Taylor’s expansion up to order two, 


FaEth=fath fot rE) with €e[x,x+h], 


from which we get 


fa +h) — fœ) 
h 
Expression (2.2) decomposes f'(x) into two parts, the approximation of the derivative and the 


truncation error. This approximation is called forward difference. The approximation by backward 
difference is defined as 


f@= L F"E). (2.2) 


f œ= +4 f'E). (2.3) 


Consider f(x + h) and f(x — h): 


f(x) — f@ —h) 
h 


FEA = fO+hf@+ Sf’ @+eP"E,) with Epel, x+h], 
fx—h) = f@-—hfi@+ Sf" @—-HF"E) with £- €[x—-A, x], 


then the difference f(x +h) — f(x —h) is 


fx th) — fen) =r f+ EP E+ FE). 
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If f” is continuous, we can replace Pere) by the mean f” (€), € € [x —h, x+ h] and we 
get 


© (FEL) es fr") 2 ak F 6+) j f &-) 


F"E) 
i F"E) , 


~ 


_ FR FR) = fie—h) 
~ 2h 


f(x) 


which defines the approximation by central difference. We observe that this approximation is more 
precise as the truncation error is of order O(h?). 


Approximating second-order derivatives 


Developing the sum f(x +h) + f(x — h) in which 


FEH = FAHA OESE" E with el, x+h], 


fæ-h) = fO -h fE SE- ESE) with & el -—h, x], 


we get 


_fath—2f@)+ fa -h 


i E iey with &e[x—h,x+hl. 


Fœ 


There exist several different approximation schemes, also of higher order (see Dennis and Schn- 
abel, 1983). 


Partial derivatives 


The approximation by a central difference of the partial derivative with respect to x of a function 


fœ, y) is 


pa fœ +h, y)— fœ -— hx, y) 
=< 2h, : 


Approximating the derivative of fy with respect to y with a central difference, we get the approxi- 
mation for fry, 


f@thx, ythy)— f (x—hx,y+hy) f@thx,y—hy)— f(x—hx,y—hy) 
f, 2hx 2hy 
oe 2hy 


Tihs (FO thx: y+ hy) — fæ — hx, y + hy) 
-f they — hy) + fE — hay —hy)) 


How to choose h 


How to choose h if we use floating point arithmetic? To reduce the truncation error, we would be 
tempted to choose h to be as small as possible. However, we also must take into consideration the 
rounding error. 

If h is too small, we have float(x + h) = float(x) and even if float(x + h) Æ float(x), we can 
have float ( fx + h)) = float ( f (x)) if the function varies very slowly. 
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Truncation error for forward difference 
Denote 


h) — 
joo 


as the approximation of the derivative corresponding to a given value of h. If we accept that in the 
truncation error — h f" (&) we have for a number M and all t in the domain of our application, 


If" <M 
then we get the following bound for the truncation error 
fn) — F'O AM 
indicating that we can reach any precision, given h is chosen to be sufficiently small. However, in 


floating point arithmetic, it is not possible to evaluate f; (x) exactly because of rounding errors. 
Consider that the rounding error for evaluating f(x) has the following bound 


| float ( f(x)) — f(@)| <e. 


Then, the rounding error (for finite precision computations) for a forward difference approximation 
is bounded by 


| float (f4(x)) — HO <3 


float ( f(x) = ion 


h 
float ( f (x + h)) — float (f(x) 
7 h 
e+e 


h 


Finally, the upper bound for the sum of both sources of errors is 


[float (4) — f'@)| s+ Th. 


A compromise between the reduction of the truncation error and that of the rounding error becomes 


necessary. As a consequence, the precision we can achieve is limited.” The optimal value for h 
corresponds to the minimum for g(h) = 2e + tM . From the first-order conditions, 


g'(h)=-3 + #=0, 


we compute that the bound reaches its minimum for 


h=2/§. 


In practice, € corresponds to the machine precision €m, and if we admit that M is of order one, 
then setting M = 1 we get 


h = 2y E mach x 
In MATLAB, we have €n œ 2 x 107! and h ~ 1078. 


5. h is chosen as a power of 2 and therefore can be represented without rounding error. 
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Relative error 


1 1.5 2 -16 -8 -5 0 
10 10 10 10 


FIGURE 2.1 Function f(x) = cos(x*) — sin(e*) (left panel) and relative error of the numerical approximation of the 
derivative at x = 1.5 for forward difference (thick line) and central difference (thin line). 


In the case of central differences, we have g(h) = =e + i'M . The first-order condition is 


from which we get h = (3€/M)!/?. Replacing € and M gives 


hx10->. 


Example 2.1 


To illustrate the influence of the choice of h on the precision of the approximation of the derivative, we 
consider the function f(x) =cos(x*) — sin(e*) and its derivative 


f' (x) = —sin(x*) x* (log) + 1) —cos(e*)e*. 


Fig. 2.1 shows the relative error for the numerical approximation of the derivative evaluated at x = 1.5, 


as a function of the step size h in the range of 107!® to 1. For the central difference approximation, the 


minimum error is achieved around h = 1075, and for the forward difference approximation, the optimal 


value is h = 1078 according to our previous results. 


2.4 Numerical instability and ill-conditioning 


In Section 2.1, it has been shown that binary machines can represent only a subset of the real num- 
bers, introducing rounding errors that may seriously affect the precision of the numerical solution. 
If the “quality” of a solution is not acceptable, it is important to distinguish between the following 
two situations: 


Rounding errors are considerably amplified by the algorithm. This situation is called numerical 
instability. 


. Small perturbations of data generate large changes in the solution. This is termed an ill- 


conditioned (or sensitive) problem. 


In the following we give illustrations of numerical instability and of an ill-conditioned problem. 


Example of a numerically unstable algorithm 


Consider the second-order equation ax* + bx + c = 0 and its analytical solutions 


—b — Vb? — 4ac —b + vb? — 4ac 
xy = ——— x2 = ————. 


2a 2a 
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The following algorithm simply translates the analytical solution: 


1: A=Vb2 —4ac 
2: xı =(—b— A)/(2a) 
3: x2 = (—b + A)/(2a) 


Fora = 1, c = 2, and floating point arithmetic with 5 digits of precision, the algorithm produces 
the following solutions for different values of b: 


fl = 
b A float(A) —float(x2) x2 pol 
5.2123 4.378135 4.3781 —0.41708  —0.4170822 1.55 x 1076 
121.23 121.197 121.20 —0.01500 —0.0164998 1.47 x 107° 
1212.3 1212.2967 1212.3 0 —0.001649757 Catastrophic 
cancelation 


Note that float(x2) = (—b + float(A)) /(2a) 


Let us now consider an alternative algorithm exploiting the relation x;x2 = ¢/a satisfied by the 
solutions according to Viéte’s theorem: 


: A=Vb2 — 4ac 
: if b <0 then 

xı = (—b + A)/(2a) 
else 

xı =(—b— A)/ 2a) 
: end if 
x2 =c/(a xı) 


ae OY tt a tee 


This algorithm avoids catastrophic cancelation, that is, loss of significant digits when computing 
a small number by subtracting two large numbers. Solving for b = 1212.3, we get an astonishingly 
precise result considering that we use only 5 digits of precision. 


b float (A) float (x1) float (x2) x2 
1212.3 1212.3 —1212.3 —0.0016498 —0.001649757 


Example of an ill-conditioned problem 


We consider the following linear system for which we easily verify that the exact solution is x = 
[1 —1]’. Note that in practice, one almost never knows the solution. In our case, it has been obtained 


by construction. 
ees 0.780 0.563 b= 0.217 l (2.4) 
0.913 0.659 0.254 


Solving this linear system with MATLAB, we obtain 


(2.5) 


Ree l | | 


—0.99999999987542 


Despite the seemingly high precision of the solution, this is an ill-conditioned problem. Consider 
the matrix 


g| 9.001 0.001 
~ | —0.002 —0.001 
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5 4.5418 
4.5417 } 
A Q 
-5 4.5416 
-5 1 5 -3.0001 -3 -2.9999 


FIGURE 2.2 Left panel: graphical solution of linear system (2.4). Right panel: details around the coordinate (—3, 4.54) of 
the straight lines defining the linear system. 


and the perturbed system (A + E)xg = b. The new solution is then 


_ | —5.0000 
ETI 7.3085 | 
which deviates an order of magnitude from the original solution with respect to the perturbation in 
matrix A. The solution of the linear system (2.4) corresponds to the intersection of the two lines 
corresponding to the equations. This is shown in the left panel of Fig. 2.2, in which only one line 
is visible as the lines are very close to being parallel. The right panel illustrates the zoomed-in 
region around the coordinate (—3, 4.54), which clearly shows the existence of the two distinct 


lines. Thus, a very small perturbation of the coefficients defining the lines leads to a substantial 
shift of the intersection. 


2.5 Condition number of a matrix 


The sensitivity of the solution of a problem f(x) with respect to a perturbation of the data x is 
called the condition of the problem. It can be considered as the absolute value of the elasticity 


~ eS 
For 


If this elasticity is large, the problem is referred to as ill-conditioned. Note that we can distinguish 
three situations in which the condition number can become large: 


cond ( f (x)) 


e f'(x) very large with x and f(x) normally sized 
e x very large with f(x) and f’(x) normally sized 
e f(x) very small with x and f'(x) normally sized 


The condition number of a problem is generally difficult to estimate. However, for a linear system, 
the condition number of the coefficient matrix A is defined as 


K(A) = || AT" AIl, 


which can be computed efficiently in MATLAB with the function cond. For our example we have 
cond(A) = 2.2 x 106. 

As a rule of thumb, if cond(A) > 1/,/eps, we should worry about our numerical results. In 
MATLAB, we have eps = 2.2 x 107!6 which gives 1/,/eps = 6.7 x 108, which means that in 
solving a problem with a condition number of the order 108, we lose, in the results, half of the 
significant digits. Of course, we may become suspicious even if we encounter smaller condition 
numbers depending on the kind of problem. 
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Example 2.2 


In practical applications, we may well find situations with perfectly acceptable condition numbers, but, 
nevertheless, we should not trust the solutions too much. In other words, we may encounter settings for 
which we can numerically compute a solution without a problem, but should be careful in interpreting 
the results. 

An example: in a linear regression model, the Jacobian is the data matrix. The condition number 
of this matrix can be acceptable even though high correlation between the columns may prohibit any 
sensible inference regarding single coefficients. The following MATLAB script sets up a linear regression 
model with extremely correlated regressors. 


% -- set number of observations, number of regressors 
nObs = 150% nR = 5; 


% -- set up correlation matrix 
ones (nR,nR) * 0.9999; C( 1: (mR+1):(nR*nR) ) = 1; 


Q 
Il 


oe 
l 


- create data 

randn(nObs,nR); C = chol(C); X = Xx«C; 
bTrue = randn(nR,1); 

y = X*bTrue + randn(nObs,1)*0.2; 
plotmatrix(X) 


ad 
ll 


The regression X\y can be computed? without numerical problems, though we had better not inter- 
pret the coefficients. 


Comments and examples 


The condition number represents a measure of the upper bound for the problems one may encounter 
when solving a system of equations. As another example, consider the system of equations with 


10-19 9 
o 100 


A= 


and any value for b. The condition number of matrix A is cond(A) = 107° but we can compute the 
solution x; = b; / Ajj without any numerical problem. 

Given a linear system Ax = b and its computed solution xc, the quantity r = Ax, — b is called 
residual error. One might think that the residual error is an indicator for the precision of the com- 
putation. This is only the case if the problem is well-conditioned. As an illustration, we take the 
example from page 25 and consider two candidates x, for the solution and their associated residual 
error: 


0.999 —0.0013 0.341 1076 
Xe= r= and x.= r= 
—1.001 —0.0016 —0.087 0 


We observe that a much smaller residual error is produced by the candidate that is by far of lower 
quality. Hence, we conclude that for ill-conditioned systems, the residual error is not suited to give 
us information about the quality of the solution. 

A matrix can be well-conditioned for the solution of the associated linear system and ill- 
conditioned for the identification of its eigenvalues, and vice versa. This is illustrated by the 
following examples. 

As a first example, we take A = triu (ones (20) ), that is, an upper triangular matrix with all 
its entries equal to one. A is ill-conditioned for the computation of its eigenvalues. With MATLAB’s 


6. The workings of MATLAB’s backslash operator are explained on page 56. 
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function eig, we compute 
Ai, ..- AW=1, 
and setting A20,1 = 0.001, we get 
Ay =2.77, ... ,420 = 0.57 — 0.087 . 


However, A is well-conditioned for solving a linear system as cond (A) is 26.03. 
As a second example, we take the matrix 


A= 1 1 l 
1 1+ô 


which poses no problem for the computation of its eigenvalues. For 5 = 0, we get the following two 
eigenvalues 


Ay=O and A2=2, 
and setting ô = .001, we get 
A, =0.0005 and Az=2.0005. 


In turn, A is ill-conditioned for the solution of a linear system. For 6 = 0, A is singular, that is, 
cond (A) is Inf. For ô = 0.001, we obtain cond (A) is 4002. 

To conclude, we recall that if we want to obtain sound numerical results with a computer, we 
need to make sure that we have: 


e awell-conditioned problem 
e anumerically stable algorithm if executed with finite precision 
e software that is a good implementation of the algorithm 


Finally, a numerically stable algorithm will never be able to solve an ill-conditioned problem 
with more precision than the data contain. However, a numerically unstable algorithm may produce 
bad solutions even for well-conditioned problems. 


2.6 A primer on algorithmic and computational complexity 


The analysis of the computational complexity of an algorithm provides a measure of the efficiency 
of an algorithm and is used for comparison and evaluation of the performance of algorithms. Only 
a few elements relevant for characterizing the properties of the algorithms presented in this book 
will be discussed.’ 

The definition of the problem size is central as the computational effort of an algorithm should 
be a function of it. Generally, the size is expressed as roughly the number of elements the algorithm 
has to process (amount of data). In the case of the computation of the solution of a linear system 
Ax = b, the size of the problem is given by the order n of the matrix A. For a matrix multiplication, 
the size is defined by the dimension of the two matrices. 


Criteria for comparison 


A detailed comparison is generally difficult to perform and mostly not very useful. The most im- 
portant criterion for comparison is the execution time of the algorithm. In order to be practical, the 
measure has to be computed easily and has to be independent of the computational platform. There- 
fore, the execution time is measured in number of elementary operations. Thus, the complexity of 
an algorithm is defined as the number of elementary operations to solve a problem of size n. 


7. Practically, the specific implementation of an algorithm also matters. 
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FIGURE 2.3 Operation count for algorithm A, (triangles) and A9 (circles). 


Consider an algorithm that sequentially seeks a particular item in a list of length n. Its compu- 
tation time is then C (n) < kı n + k2, where kı > 0 and kz > 0 are two constants independent of n 
characterizing a particular implementation. Generally, the worst case C (n) = kı n + kz is consid- 
ered.® 

Other criteria are the so-called space complexity, that is, the amount of fast memory necessary 
to execute the algorithm, and the simplicity of the coding. The former tends to be less of a problem 
with modern computing technology but the latter should not be underestimated. 


Order of complexity and classification 


As mentioned earlier we do not want to measure the complexity in detail. Generally, and in par- 
ticular for matrix computation, we only count the number of elementary operations consisting of 
addition, subtraction, multiplication, and division. These operations are termed flop (floating point 
operation).’ Also, only the order of the counting function will be considered. As an illustration 
consider two algorithms A, and A» in which the function counting the elementary operations is 
Ca, (n) = 1/2 n? for the first algorithm and C4, (n) = 5n for the latter. 

In Fig. 2.3, we observe that for n > 10, Algorithm A2 is faster than Algorithm A;. As computing 
time becomes critical for growing values of n, it is only the order!’ of the counting function which 
determines the complexity. Hence, the complexity of algorithm A, is O(n”) and algorithm A> has a 
complexity of O(n). The complexity of an algorithm with 1/3 n? +n? + 2/3n elementary operations 
is of order O(n). 

Algorithms are classified into two main groups according to the order of the function counting 
elementary operations: 


e Polynomial time algorithms, that is, the operation count is a polynomial function of the problem 
size. 
e Non-polynomial time algorithms. 


Only algorithms from the polynomial time class can be considered as efficient. 

The performance of computers is measured in flops per second. Personal computers evolved 
from a few Kflops (10° flops) in the early 1980s to several Gflops (10° flops) at the end of the first 
decade of 2000. The size of available fast memory increased by the same factor. 


8. Another way is to consider the average number of operations, which generally has to be evaluated by simulation. Some 
algorithms that behave very badly in the worst case perform efficiently on average. An example is the simplex algorithm for 
the solution of linear programs. 

9. An example of the count of elementary operations for an algorithm is given on page 37. 

10. The order of a function is formalized with O(-) notation, which is defined as follows: a function g(n) is O(f (n)) if 
there exist constants cy and no such that g(n) is smaller than cg f (n) for all n > ng. 
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Note that faster computers will not remove non-polynomial algorithms from the class of non- 
efficient algorithms. To illustrate this, consider, for instance, an algorithm with complexity 2” and 
assume N to be the size of the largest problem instance one can solve in an acceptable amount of 
time. Increasing the computation speed by a factor of 1024 will increase the size of the problem 
computed in the same time to N + 10. 


Appendix 2.A Operation count for basic linear algebra operations 


Each of the elementary operations —addition, subtraction, multiplication and division— counts for 
one flop. Given the vectors x, y € R”, z € R” and the matrices A € R”*” and B € R’*", we have: 


Operation flops 


x+y n 
x! x 2n (n> 1) 
xz! nm 


AB 2mrn (r > 1) 
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Systems of linear equations arise very frequently in numerical problems, and therefore, the choice 
of an efficient solution method is of great importance. Formally, the problem consists in finding the 
solution of Ax = b, where A is the coefficient matrix, b the so-called right-hand-side vector, and x 
the unknown solution vector. In other words, we have to find the solution for the following system 
of equations 


a,x +412X2 +++-+4inkn=b1 a11412°+-Ain | [x1 by 

a21X1 +422X2 +++ ++ Arn Xn = b2 a21 422***A2n | | X2 bo 

an1X1 +an2X2 +++ ++ AnnXn =bn an1 4n2°**4nn Xn bn 
—— mmm 

A x b 


The expression on the right side is the equivalent formalization in matrix form. 

A few preliminary remarks: The solution exists and is unique if and only if the determinant of 
A satisfies |A| 4 0. The algebraic notation for the solution x = AT !b might suggest that we need 
to compute the inverse A~!. However, an efficient numerical method computes neither the inverse 
AT! to solve the system nor the determinant |A| to check whether a solution exists. 

A word of caution: The intention of the presentation of the algorithms that follow is not to 
instruct the user to write their own code for solving linear systems. It would be very hard to beat 
the code implemented in most of the popular softwares. The target of this presentation is to provide 
understanding and help to make the appropriate choices among the available techniques. For all 
details, see Golub and Van Loan (1989). 


Choice of method 

The choice of the appropriate method to solve a linear system depends on the structure and the 
particular quantification of the matrix A. With respect to the properties of the structure of the 
matrix A, we distinguish the case where the matrix is 
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e dense, that is, almost all of the elements are nonzero 

e sparse, that is, only a small fraction of the n? elements of the matrix are nonzero 

e banded, that is, the nonzero elements of the matrix are concentrated near the diagonal (the band- 
width b is defined as a;; = 0 if |i — j| > b; a diagonal matrix has b = 0 and a tridiagonal matrix 
has b = 1) 

e triangular or has a block structure 

Particular matrices with respect to their content are, among others, symmetric matrices (A = A’) 

and more importantly positive-definite matrices (x’Ax > 0,Vx Æ 0). As a rule, if a matrix has 

structure, this can (in principle) be exploited. 
There exist two broad categories of numerical methods for solving linear systems: 

e Direct methods which resort to factorization 

e Iterative methods (stationary and non stationary). 


3.1 Direct methods 


Direct methods transform the original problem into an easy-to-solve equivalent problem. For in- 
stance, a system Ax = b is transformed into Ux = c with matrix U triangular or diagonal, which 
makes it easy to compute x. Another approach consists in the transformation Qx = c with Q or- 
thogonal which again allows recovery of x immediately from x = Q’c. 


3.1.1 Triangular systems 


As triangular systems play an important role in the solution of linear systems, we briefly present 
the methods for solving them. Consider the lower triangular system 


Delete 
lo, ln | [%2 by} 


If the elements satisfy £11, £22 Æ 0, the unknowns can be computed sequentially as 


xı =b1/%1 
x2 = (b2 — b21 x1) /l22. 


Forward substitution 


For a system Lx = b, with L a lower triangular matrix, the solution for the ith equation can be 
written as 


This procedure is called forward substitution. Note that the element b; is only used for the computa- 
tion of x; (see also the above figure), and thus, we can overwrite b; with x; in vector b. Algorithm 1 
solves the lower triangular system Lx = b of order n by forward substitution and overwrites b with 
the solution x. The algorithm necessitates n? + 3n — 4 flops. 


Algorithm 1 Forward substitution. 
1: by =b1/L1.1 
2: fori =2:n do 
3: bj = (bi — Li 1:i—1 bri—1)/Lii 
4: end for 
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Back-substitution 


For an upper triangular system Ux = b we proceed in an analogous way as done before, that is, 


Ui Xi bi 


This procedure is called back-substitution. Again the element b; can be overwritten by x;. Algo- 
rithm 2 solves the upper triangular system Ux = b by back-substitution and overwrites b with the 
solution x. The operation count is identical to the one of Algorithm 1. 


Algorithm 2 Back-substitution. 
1: bn =bn/Un,n 


2: fori =n—1:—1:1 do 
3: bi = (bi — Ui i+1:n bÞi+1:n)/Uii 
4: end for 


3.1.2 LU factorization 


LU factorization is the method of choice if matrix A is dense and has no particular structure. Matrix 
A is factorized into a product of two matrices L and U, which are lower and upper triangular, 
respectively. In Ax = b, we replace A by its factorization LU and solve the transformed system 
LUx = b in two steps, each involving the solution of a triangular system. This is illustrated in 
Fig. 3.1 where y is an intermediate vector. 


z b 
L ‘i 
(———_ 
y 
L y| = |b| Solve Ly=b (forward substitution) 
U 
x| = |y| Solve Ux=y (back-substitution) 


FIGURE 3.1 LU factorization. 


LU factorization necessitates 2/3 n? — 1/2n? — 1/6n + 1 elementary operations, and therefore, its 
complexity is O(n’). 


LU factorization with MATLAB 


MATLAB® uses the syntax x = A\b to solve Ax = b no matter what type A is.! Let us illustrate 
the three steps of the procedure presented above. 

The MATLAB command [L,U] = lu(A) computes L and U, but L is a permuted triangu- 
lar matrix which cannot be used as such in the forward substitution. Therefore, we execute the 
command 


[L,U,P] = lu(A) 


1. The workings of the backslash operator \ in MATLAB are explained on page 56. 
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that produces a triangular matrix L, and a permutation matrix P such that LU = P A, which means 
that L and U correspond to the LU factorization of a permuted system of equations. The lines 
of code that follow illustrate the procedure. Forward and back-substitutions are executed with the 
backslash operator, and the vector y is not explicitly computed. 


n= 5; x = ones(n,1); A = unidrnd(40,n,n); b = Ax*x; 
[L,U] 

x1 = L\(U\b); 

[L,U,P] = lu(A); 

x2 = U\(L\P*b); 

x3 =A \ b; 


The output produced is given below. We observe that x1, which is computed without taking into 
consideration the permutation, produces a wrong result. 


disp([xl x2 x3]) 


17 «2993. 1.0000 1.0000 
4.0528 1.0000 1.0000 
=191852 1.0000 1.0000 
-5.4929 1.0000 1.0000 
=1147563 1.0000 1.0000 


If the system has to be solved several times for different right-hand-side vectors, the factoriza- 
tion has to be computed only once.? 


n = 300; R = 1000: A = unidrnd(40,n,n): Baunidrnd(50,n,R); 


£0 = tic: 
X = zeros(n 


end 


fprintt(’ Elapsed time %5.3f£ sec\n’,toc(t0)); 
Elapsed time 0.762 sec 


Note that this can be coded in a more efficient way by applying the permutation to the whole 
matrix in one step and by overwriting B with the solution. The matrices A and B are taken from the 
previous code. 


tO = CIG, 
[L,U,P] = lu(A); 
B = PxB; 
for k= LR 
B(:,k) = U\(L\B(:,k)); 
end 
fprintf(" Elapsed time %5.3f sec\n’,toc(t0)); 


Elapsed time 0.166 sec 


According to the elapsed time, we have an improvement of about 5 times. However, the most 
efficient way to code the problem is to apply the backslash operator to the right-hand matrix. This 
is about 40 times faster than our initial code. 


£0 = tac: 
x =A \ Be 
fprintt(’ Elapsed time %5.3f sec\n’,toc(t0)); 


Elapsed time 0.020 sec 


2. The MATLAB command tic starts a stopwatch timer and toc reads the timer. 
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3.1.3. Cholesky factorization 


If the matrix A is symmetric and positive-definite, we use the Cholesky factorization. A matrix 
A € R”*” is positive-definite if for any vector x € R”, x 4 0, we have x’Ax > 0. This is, among 
others, the case for A = X’X with matrix X having full column rank, a very frequent situation in 
econometric and statistical analysis. 


This factorization necessitates fewer computations as it produces only one triangular matrix R. 
Its operation count is 1/3? +n? + 8/3n — 2 flops. To solve a linear system Ax = b, we again replace 
A by its factorization R’R and proceed with the forward and back-substitutions as done with the 
LU factorization. 

The Cholesky decomposition is also appropriate to test numerically whether a matrix A is 
positive-definite. In MATLAB this is done by calling the function chol which computes the 
Cholesky factorization with two output arguments [R,p] = chol (A). If the function returns 
p #0, the submatrix A(1:p-1,1:p-1) is positive-definite. The following MATLAB code illus- 
trates the use of the Cholesky factorization for the solution of a linear system with a symmetric and 
positive-definite coefficient matrix A. 


no 5S: m = 207 

X = unidrnd(50,n,n); A = K’*Rs x = ones(n,1); b = Axx: 
[R,p] = chol (A); 

if p > 0, error(’A not positive-definite’); end 

x1 = R\(R’\b); 


The Cholesky factorization can also be used to generate very cheaply multivariate random variables. 
A random vector x with given desired variance—covariance matrix © is generated with the product 


x= R'u, 


where R’R = È is the Cholesky factorization and u ~ N(0, I) is an i.i.d. normal vector. The 
variance—covariance matrix of x then satisfies (see Section 7.1.1) 


E(xx’) = E(R' uu’ R)= R’R=Y. 
I 


The MATLAB code given below generates 1000 observations of three normally distributed vectors 
with variance—covariance Q and then re-estimates their variance—covariance matrix.’ 


Q = [400 40 -20 
40 80 -10 
-20 -10 300]; 


[R,p] = chol(Q); 

if p ~= 0, error(’Q not positive-definite’); end 
mn = 1000; U = randn(3:,n) z 

Z = R’*U; 


disp(fix(cov(Z’))) 


407 42 =33 
42 82 =15 
=33 =15 284 


3. Note that the smaller the value n, the larger the deviations between the variance—covariance matrix of the data-generating 
process and the empirical variance—covariance matrix will become. 
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The Cholesky algorithm 


In order to explain how the Cholesky matrix can be computed, we consider the jth step in the 
factorization A = GG’, where the first j — 1 columns of G are already known. In the following 
scheme the light gray area corresponds to the elements already computed and the dark one to the 
elements computed at step j. (The unknown element G ;; has been left uncolored.) 


Column j of the product then corresponds to a combination of the j columns of matrix G1:n,1:j, 
that is, 


J 
Ain, j = 5 G jk Gi:n,k 
k=1 


from where we can isolate 
jal 
G jj Gin, j = Alnj — pier Gink =v 
k=1 
and as Gj:n,1:;-1 is known, we can compute v. We then consider the expression G jj Gi:n,j = V; 


the components of vector G1:j—1,j are zero as G is a lower triangular matrix. Element j of this 
vector is G jj Gj; = vj from where we get that G j; = ./v;. The solution for vector G j:n, j is then 


Gj:n,j = Vjin/y Vj 


from which we can derive the following algorithm: 


for j= 1:n do 


Vin = A j:n,j 
fork =1:j-—1do 
Ujin = Vjin — G jk Gjin,k 


end for 
Gj:n,j = Vj:n/ yj 
end for 


It is possible to formulate the algorithm so as to overwrite the lower triangle of matrix A by 
matrix G. Given a positive-definite matrix A € R”*” Algorithm 3 computes a lower triangular 
matrix G € R”*” such that A = GG’. For i > j, this algorithm overwrites A;; by Gij. 


Algorithm 3 Cholesky factorization. 
l: Aln, 1 = ÂÅl:n,1/ V411 
2: for j =2:ndo 
3: Ajin, j= Ajn, jT Ajn, j1 Ahija 
4 
5 


Ajin,j = Ajn, j/y Ajj 


: end for 


The operation count for Algorithm 3 is 1/33 + n? + 8/3n — 2 flops. 
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Example 3.1 


For illustration purposes, we detail the operation count for Algorithm 3. We have n + 1 operations in 
Statement 1 (n divisions and one square root). For Statement 3, we count: 
f: 


e 2(n— j+1)(j — 1) operations for the product A j:n,1:j=1 A% 1:j—1 


e n-— j+ 1 operations for the subtraction 
e 2 operations for the computation of indices (j — 1) 


Total for Statement 3 is 2(n— j +1)(j— 1) +n — j + 1+2 which simplifies to (an— j + 12-1) +1) +2. 
Statement 4 necessitates (n — j + 1) + 1 operations (n — j + 1 divisions and one square root). 
Statements 3 and 4 together total (n — j + 1)(2(j — 1) + 2) + 3 operations. To compute the sum of 

these operations for j = 2, ...,n, we use MATLAB’s symbolic computation facilities. To the sum, we add 

the n + 1 operations occurring in Statement 1 and simplify the result. 


syms n j 
fljgl = symsum( (n-j+1)»(2»(j-1)+2)+3,j,2,n); 
flops = simplify(fljg1 + n + 1) 


flops = 8/3*n-24+1/3%*n%3+n%*2 


3.1.4 QR decomposition 


If A € R”*” is square or rectangular and has full column rank, we can compute the QR decompo- 
sition. The following scheme illustrates the decomposition for a rectangular matrix. 


Q is an orthogonal matrix, that is, Q’Q = I. In the case matrix A is square, Q2 vanishes in the 
partition of Q. R is triangular. The QR decomposition is computationally more expensive; the 
highest term in the operation count is 4m? n. 

To solve a linear system Ax = b we again replace A by its decomposition, multiply the system 


from the left by Q’, and solve the resulting triangular system by back-substitution. 


R 


R Solve Rx=Q’b 
(back-substitution) 


The MATLAB function qr computes the decomposition, and the solution of the triangular sys- 
tem is again coded with the backslash operator. 


3.1.5 Singular value decomposition 


The numerically most stable decomposition is the singular value decomposition (SVD), but it is 
also the most expensive computationally (4m? n + 8mn? + 9n? flops). Given a matrix A € R”*” 
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this decomposition computes two orthogonal matrices U € R’”*” and V € R”*", of left and right 
singular vectors, respectively, such that 


where © = diag(01, ..., op) € R”*” and p = min(m, n). The o; are called the singular values, and 
they satisfy o1 > 02 > --- > op 20. 

This decomposition is among others used to compute the pseudoinverse.* An important use of 
SVD is the computation of the numerical rank of a matrix. This is accomplished by searching the 
number of singular values greater than a given tolerance. MATLAB computes the decomposition 
with the command [U,S,V] = svd(A). If the function is called with a single output argument, 
svd returns the vector of singular values. The MATLAB function eps with argument x computes 
the distance from x to the next larger floating point number. 


Ss = svd(A); 
tol = max(size(A)) * eps(s(1)); 
rank = sum(s > tol); 


3.2 Iterative methods 


If we ignore rounding errors, a direct method produces an exact solution after a number of finite and 
known elementary operations. In contrast, iterative methods generate a sequence of approximations 
to the solution in an infinite number of operations. However, in practice, the approximation is 
satisfactory after a relatively small number of iterations. 

Iterative methods simply proceed by matrix vector products and are, therefore, very easy to 
program. Given that the complexity of direct methods is O(n°), there is a limit for the size of 
manageable problems. However, large systems are generally sparse, that is, only a small fraction 
of the n? elements is nonzero. In such situations an iterative method can very easily exploit this 
sparse structure by simply taking into account the nonzero elements in the matrix vector product.’ 
However, this comes with the inconvenience that efficiency depends on the speed of convergence, 
which is generally not guaranteed. 

If we use an iterative method to solve a sparse system, we have to verify the existence of a 
normalization of the equations. This corresponds to the existence of a permutation of the rows and 
columns of the matrix such that all elements on the diagonal are nonzero. The problem of finding a 
normalization in a sparse matrix is discussed in Section 3.3.3. 

Iterative methods can be partitioned into two classes, stationary methods relying on information 
remaining invariant from iteration to iteration and nonstationary methods relying on information 
from the previous iteration. Stationary iterative methods go back to Gauss, Liouville, and Jacobi 
whereas nonstationary methods, also often referred to as Krylov subspace methods, have been 
developed from around 1980. An overview can be found in Barrett et al. (1994) and Saad (1996). 
They are among the most efficient methods available for large® sparse linear systems. The following 
presentation focuses on stationary iterative methods only. 


4. See page 55. 

5. Nowadays, direct methods exist that efficiently exploit a sparse structure. However, their implementation is more com- 
plex. 

6. Allowing an efficient solution of a system of equations going into the order of millions of variables. 
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3.2.1 Jacobi, Gauss-Seidel, and SOR 


Given a linear system Ax = b, we choose a normalization,’ indicated in gray below, 


by 
anxı + 22 x2 +.423x3 = b2 


a11 X1 + 2x2 + 413X3 


a31X| + a32x2 + 433 x3 = b3, 


isolate the variables associated to the normalization and rewrite the system as 


xı = (bı — ay2x2 — a43x3) /a11 
x2 = (b2 — a21x1 — a23x3) /a22 
x3 = (b3 — 431X| — 432X2) /a33. 


The variables x; on the right-hand-side are unknown, but we can consider that x) is an approx- 
imation for the solution of the system, which allows us to compute a new approximation in the 
following way: 


k+1 k k 

xí : Q — apxí = ai3xí ’) /ait 
k+l k k 

x$ ) (b2 — ax! _ axí , /a22 


k+1 k k 
xí Oe (23 — azıx! L az2xí i /a33. 


This defines the Jacobi iteration that is formalized with the pseudocode given in Algorithm 4. 


Algorithm 4 Jacobi iteration. 


1: give initial solution x € R” 

2: fork =0,1,2,...until convergence do 
3: fori =1:ndo 
4: xt) = ( p= pare? aijx\) /Qijj 
5 end for 

6: end for 


If we use the computed approximation on the left-hand-side as soon as it is available in the 
succeeding equations, we have the Gauss-Seidel iteration 


kt] k k 
xí ) = (bı -anxi —ay3x5 ’) /a11 


k+l KFI k 
x ) =(b.—ar1 xí a -—az3xí ) /az2 


k+1 k+1 k+1 
x$ ) =(b3—asix4 ies — a32x$ i ’) /a33. 


With the Gauss-Seidel iteration, we do not need to store the past and current solution vector as we 
can successively overwrite the past solution vector x with the current solution. The corresponding 
pseudocode is given in Algorithm 5 where we also sketch how the solution vector is successively 
overwritten. For j <i, the elements in array x have already been overwritten by the newly com- 


puted xh 


j and for j > i, the elements correspond to the a0 of the previous iteration. 


7. For large and sparse matrices with irregular structure, it might not be trivial to find a normalization. For a discussion see 
Gilli and Garbely (1996). 
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Algorithm 5 Gauss-Seidel iteration. 


1: give initial solution x € R” 


2: while not converged do 1 2 i-1 i i+1 n 

3 fori =1:ndo x: 

4: xi = (bi — ; sayay) aij XJ | 4 
f i L jzi ijj) i x(k+1) x, (k+1) x(k) 

5 end for i 

6: end while 


Successive overrelaxation 


The convergence of the Gauss-Seidel method can be improved by modifying Statement 4 in the 
Gauss-Seidel algorithm in the following way 


xf) Sox +A- ox, (3.1) 


where x& p : ) is the value of eee computed with a Gauss-Seidel step and w is the relaxation 
parameter. Hence, successive overrelaxation (SOR) is a linear combination of a Gauss-Seidel step 
and the past SOR step. 
We will now show how the SOR iteration (3.1) can be simplified. Indeed, it is not necessary to 
explicitly compute the Gauss-Seidel solution ree We recall the generic Gauss-Seidel iteration 
i-1 


n 
k+1 k+1 k 
xl = [bi — y aij = kz ) aijx\ /Qii (3.2) 
j=l j=i+1 


and the fact that an array x can store the relevant elements of x“ and x“+) simultaneously. We 
visualize this in the following scheme for the ith component of x“+)), 


(k+1) 
b; ii Xi:i-1 
N : it) 


k) 
ha 


This is the reason why in Statement 4 of Algorithm 5 we drop the upper indices and merge both 
arrays into a single array and write 


J#i 
Expression (3.1) can be rewritten as 
N = ofa” — a) + x. (3.3) 
We consider the expression Coe — x) , where we replace r&r with the expression given in 


(k) 


(3.2) and multiply the term x; ~ on the right by a;;/aj;. 


i-1 (k+1) n _y (k) (k) 
yD bi — j=l GijXj = i jx j ix; 
' aii aii 


(k 


Including aj;x; ) in the second sum, we get 


i-1 n 


k+1 k+1 k 
j=l j=i 
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and given that xÉ 


can be written as 


(k+1) 


) and Xi can be stored in the same array, (3.3), which is equivalent to (3.1), 


xi = œw (bi — Åi 1:nX)/aii + Xi, 


given that x is initialized with x“, This is then the generic iteration of the successive overrelaxation 
method (SOR), which is summarized in Algorithm 6. 


Algorithm 6 SOR method. 


1: give starting solution x € IR” 

2: while not converged do 

3 for i = 1 : n do 

4: xi = (Dj — Åi tn X)/Gii + Xi 
5 

6 


end for 
: end while 


The relaxation parameter w is defined in the interval 0 < œ < 2.8 We see that SOR is a gen- 
eralization in which, for œ = 1, we have the Gauss—Seidel method. If œ < 1 we have the damped 
Gauss-Seidel method which can be used to help establish convergence for diverging iterations. For 
æ > 1, we have a speedup of convergence. 

Illustrations of the application of the SOR method are given in the Example on page 42 and for 
the pricing of American put options on page 79. 


3.2.2 Convergence of iterative methods 


The choice of the normalization of the equations influences the convergence behavior of all meth- 
ods. The order in which the equations are iterated is of importance only for Gauss—Seidel and SOR. 
For a formal analysis of the convergence, we decompose the coefficient matrix A into the following 
sum 


A=L+D+U, 


where L is lower triangular, D is diagonal, and U is upper triangular. We then generalize the 
iteration equations as 


Mx&t) = Nx® +b. 


For the different methods, the matrices M and N are 


e M=D and N =-—(L +U) for Jacobi 
e M=D+L and N=—U for Gauss-Seidel 
e M=D+oL and N=(1—@)D-—@U forSOR 


An iteration scheme then converges if the spectral radius p of the matrix M~!N is smaller than 1. 
It can easily be shown that in the case of the SOR method, the spectral radius of M~!N is smaller 
than one only if 0 < w <2. 

To derive this result, we consider that according to the definition of M and N, we have A = 
M — N and substituting M — N in Ax = b gives Mx = Nx + b. Denoting the error at iteration k 
by e) =x — x, we write 


Mx=Nx+b=N (x — e®) +b= Nx® +b-Ne” 
Tn i 


M(x +) —x)=Ne Mx+)) 
e(k+1) 


e+) = MTI Ne® = (MNF e®, 


and we know that (MIN) — 0 if and only if p(M—'N) <i. 


8. An example about how to explore the optimal value is given on page 42. 
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Often, this condition is not practical as it involves much more computation than the original 
problem of solving the linear problem. However, in situations where the linear system has to be 
solved repeatedly, it is worthwhile to search the optimal value for the relaxation parameter w. Such 
a search is illustrated in the example given on page 42. 

There are a few cases in which convergence is guaranteed. For Jacobi, diagonal dominance’ is a 
sufficient condition to converge. For Gauss-Seidel, we can show that the iterations always converge 
if matrix A is symmetric positive-definite. 


9 


3.2.3 General structure of algorithms for iterative methods 


Iterative methods converge after an infinite number of iterations, and therefore, we must provide a 
stopping criterion. In general, iterations will be stopped if the changes in the solution vector, that is, 
the error, become sufficiently small. As already suggested in Section 2.2, we choose a combination 
between absolute and relative error, that is, we stop the iterations if the following condition is 
satisfied 

xD _ yO 


Rar £1,209: 


where € is a given tolerance. An implementation of this condition with MATLAB is given with the 
function converged. Moreover, it is good practice to prevent the algorithm from exceeding a 
given maximum number of iterations. This is done in Statement 5 in Algorithm 7. 


Algorithm 7 General structure of iterative methods. 


1: initialize x), xO), e and maxit 

2: while not(converged(x, x), €)) do 

3, xO =) # Store previous iteration in x) 
4: compute x) with J acobi, Gauss-Seidel, or SOR 

5 stop if number of iterations exceeds maxit 

6: end while 


Listing 3.1: C-LinEqsLSP/M/./Ch03/converged.m 


function res = converged(x0,x1,tol) 


% converged.m -- version 2007-08-10 
res = all( abs(x1-x0) ./ (abs(x0) + 1) < tol ); 


Ne 


Ww 


Note that, as mentioned earlier, the convergence of Jacobi iterations is sensitive to the normal- 
ization of the equations, whereas the convergence of Gauss-Seidel iterations is sensitive to the 
normalization and the ordering of the equations. In general, no practical method for checking the 
convergence of either method exists. 


Example 3.2 


We give an example about how to solve a linear system with the SOR method. We set a seed for the 
random generator and generate a matrix X from which we compute A which is positive-definite by 


9. A matrix A € R”*” is strictly diagonal dominant if 


hence satisfying Ji 
n ni 

= =f = dij 

p(M7 N) <|| D7 (L +U) llo = max —]| <l. 


l<i<n f 
j=! 


i#i 


dii 
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FIGURE 3.2 Grid search for optimal œw. 


construction. The optimal value for the relaxation parameter w is computed with the MATLAB code 
OmegaGridSearch.m. First, the SOR method, given in the code SOR.m, is executed for œ = 1, which 
corresponds to a Gauss-Seidel iteration. Second, the optimal value for w is used in the input for the 
SOR. The number of iterations for each execution is printed, and we observe the faster convergence 
when we use the optimal value for SOR. 

Fig. 3.2 shows the graphic produced by the function OmegaGridSearch.m and we observe that 
w © 1.17 minimizes the spectral radius of M~'N, which governs the convergence of the SOR method. 


n = 100; m = 30; X = [ones(n,1) randn(n,m-1)]; A = K AK? 
x = ones(m,1); b = Axx; x1 = 2*x; maxit = 80; tol = 1e-4; 
wopt = OmegaGridSearch (A) ; 

omega = 1; 

[sol1l,nit1] = SOR(x1,A,b,omega,tol,maxit) ; 

omega = wopt; 

[sol2,nit2] = SOR(x1,A,b,omega,tol,maxit) ; 

fprintf(’\n==> mit 1 = %1 nit 2 = Sin niti, nit?) 


==> nit _1 = 14 nit 2 = 10 


Listing 3.2: C-LinEqsLSP/M/./Ch03/OmegaGridSearch.m 


l| function [w,r] = OmegaGridSearch(A,winf,wsup,npoints) 
2|% OmegaGridSearch.m -- version 2010-10-28 

3|% Grid search for optimal relaxation parameter 

4| if nargin == 1, npoints = 21; winf = .9; wsup = 1.8; end 
5|D = diag(diag(A)); L = tril(A,-1); U = triu(A,1); 

6| omegavec = linspace(winf,wsup,npoints) ; 

7| for k = 1:npoints 

8 w = omegavec(k) ; 

9 M = D + w*L; 

10 N = (1-w)*D - wx«U; 

11 v = eig(inv(M) *N) ; 

12 s(k) = max(abs(v)); 

13| end 

14| [smin,i] = min(s); 

15|w = omegavec (i); 

16| plot (omegavec,s,'o’, 'MarkerSize’,5,’Color’,[.5 .5 .5]) 
17| set (gca, ’xtick’, [winf 1 w wsup]); 

18| ylabel(’\rho’,’Rotation’,0); xlabel(’\omega’ ) 

19| if nargout == 2, r = smin; end 


Listing 3.3: C-LinEqsLSP/M/./Ch03/SOR.m 


1] function [x1l,nit] = SOR(x1,A,b,omega,tol,maxit) 
2|% SOR.m -- version 2010-10-25 

3|% SOR for Ax = b 

4| if nargin == 3, tol=le-4; omega=1.2; maxit=70; end 
S}it = 0; n = length(x1); x0 = -x1; 

6| while ~converged(x0,x1,tol) 

7 x0 = x1; 
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8 for i = fn 

9 x1(i) = omega*x( b(i)-A(i,:)*x1l ) / A(i,i) + x1(i); 
10 end 

11 it=it+1; if it>maxit, error(’Maxit in SOR’), end 

12| end 

13| if nargout == 2, nit = it; end 


3.2.4 Block iterative methods 


Certain sparse systems of equations can be rearranged such that the coefficient matrix A has an al- 
most block diagonal structure, that is, with relatively dense matrices on the diagonal and very sparse 
matrices off-diagonal. In economic analysis, this is the case for multi-country macroeconomic mod- 
els in which the different country models are linked together by a relatively small number of trade 
relations. 

A block iterative method is then a technique where one iterates over the subsystems defined by 
the block diagonal decomposition. The technique to solve the subsystems is free and not relevant 
for the discussion. 

Let us consider a linear system Ax = b and a partition of A in the form 


Ay Aj © Aw 
Az} An = Aon 


A = ’ 
Ani An2 `: Ann 

where the diagonal blocks A;;, i = 1,..., N are square. Writing the system Ax = b under the same 
partitioned form, we have 

Ai +s: Ain xı by N 

= h or X Ai xj =i i=1,...,N. 

Ami + Ann | L*n by = 

If the matrices A;;, i = 1,..., N are nonsingular, Algorithm 8 can be applied. As mentioned earlier, 


the linear system in Statement 4 can be solved with any method. 


Algorithm 8 Block Jacobi method. 


1: give initial solution x© € R” 
2: fork =0,1,2,... until convergence do 
3: fori = 1: N do 


N 

4: solve Aye =bi— 5 Ax 
j= 
J#i 

5: end for 

6: end for 


Modifying Statement 4 in Algorithm 8 in the following way, 
i-1 N 
solve Aya? = bj — > Aa — > Aaja? 
j=l faith 
leads to the block Gauss-Seidel method. 


There exist many variants of block methods according to the choice of the decomposition and 
the way the block is solved. A particular variant consists of choosing N = 2 with A4; being a lower 
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triangular matrix, that is, easy to solve, and A 22 arbitrary but of significantly smaller size, which 
will be solved with a direct method. 


Ai — Ate 


Ty, A22 


In such a situation, the variables in A22 are called minimal feedback vertex set.!° If the systems in 
Statement 4 are solved only approximatively, the method is called incomplete inner loops. 

As explained for the iterative methods on page 41, the convergence of the block iterative meth- 
ods depends on the spectral radius of matrix M~'N. 


3.3 Sparse linear systems 


Any matrix with a sufficient number of zero elements such that there is a gain from avoiding redun- 
dant operations and storage is a sparse matrix. Very large but sparse linear systems arise in many 
practical problems in economics and finance. This is, for instance, the case when we discretize 
partial differential equations. These problems cannot be solved without resorting to appropriate 
algorithms. 


3.3.1 Tridiagonal systems 


A simple case of a sparse system is a linear system with a banded matrix of bandwidth bw = 1 
(aij = 0 if |i — j| > bw). Such a matrix is termed tridiagonal, and its storage necessitates only three 
arrays of length n respectively n — 1. The LU factorization of a tridiagonal matrix A can be readily 
derived by sequentially developing the product LU = A 


1 uy ri dı qı 
lı 1 u2 r2 Pi d q 
b 1 u3 73 = p2 d3 93 
£3 1 ua 14 p3 d4 q4 
£4 1 us p4 d5 
a 
a mm 
L v A 
Algorithm 9 details the factorization of a tridiagonal matrix A with p;, i = 1,...,n — 1 for 
the lower diagonal, dj, i = 1,...,n for the main diagonal, and q;, i = 1,...,n — 1 for the upper 
diagonal. We verify that r; = qi, i = 1,...,n — 1 and only the lower diagonal of L and the main 


diagonal of U have to be computed. The operation count for Algorithm 9 is 4n — 4 flops. 


Algorithm 9 Factorization of tridiagonal matrix. 


1: u; =d1 

2: fori =2:n do 

3: Oj. = pi-1/ui-1 
4: uj =d; — €;-14i-1 
5: end for 


Forward and back-substitution for these particular triangular systems are given in Algorithm 10. 
Given the lower diagonal £, the diagonal u, and the upper diagonal q, the algorithm overwrites the 
right-hand-side vector b with the solution. The operation count is 7n — 5 flops. 


10. See Gilli (1992) for more details. 


ee 
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Algorithm 10 Forward and back-substitution for tridiagonal system. 


1: give £, u, q and b 
2: fori =2:n do 


3: bj = b; — G1 bi—1 

4: end for 

5: bn =bn/un 

6: fori =n —1:—1:1 do 
T: bp = (bi — qi bi+1)/ui 
8: end for 

Example 3.3 


We generate a tridiagonal linear system of size n = 2 x 10° and solve it using first MATLAB’s built-in 
sparse matrix computations facilities, and then using the sparse code lu3diag.mand solve3diag.m. 
We observe that execution speed is in favor of the suggested implementation because we exploit the 
particular tridiagonal structure. 


n = 2000000; 

c = (1:n+1) / 3; d = ones(n,1); x = ones(n,1); 
p = -c(2:n); gq = ac(1l:n-1); 

A = spdiags([[p NaN]’ d [NaN q]’],-1:1,n,n); 

b = A*x; 

£0 = ‘fic? 

[L,U] = lu (A); 


sl = U\(L\b); 

fprintf(’\n Sparse Matlab %5.3f (sec)’,toc(t0)); 
£O-= aes 

[l,u] = lu3diag(p,d,q); 

s2 = solve3diag(1l,u,q,b); 

fprintf(’\n Sparse code $5.3£ (sec) \n’,toc(t0)); 


Sparse Matlab 0.536 (sec) 
Sparse code 0.114 (sec) 


Listing 3.4: C-LinEqsLSP/M/./Ch03/lu3diag.m 


1) function [1,u] = lu3diag(p,d,q) 
2|% lu3diag.m -- version 2018-11-30 
3}n = length(d); 1 = zeros(1,n-1); u = 1; 
4,u(1) = d(1); 
S for i = 2en 
6 1(i-1) = p(i-1)/u(i-1); 
7 u(i) = d(i) - 1(i-1)*q(i-1); 
8| end 
Listing 3.5: C-LinEqsLSP/M/./Ch03/solve3diag.m 
l| function b = solve3diag(1,u,q,b) 
2|% solve3diag.m -- version 1999-05-11 
3|% Back and forward substitution for tridiagonal system 
4)n = length(b); 
5| for i = 2:0 
6 b(i) = b(i) - 1(i-1) * b(i-1); 
7| end 
8| b(n) = b(n) / u(n); 
9| for i = n-1:-1:1 
0 b(i) = ( b(i) - q(i) * B(itl) ) / uli); 
1) end 
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3.3.2 Irregular sparse matrices 


There exist different storage schemes, the choice of which is problem dependent. MATLAB stores 
the nonzero elements of a matrix column wise. For a matrix A with n = 3 columns and containing 
nnz = 4 nonzero elements, MATLAB uses the following representation: 


Columns 
2. 3 


010 Column pointers: 
ee ESS 


400 


Rowindex:| 2| 3/112] nnz 


Elements:| 2| 4/113] nnz 


This storage scheme needs n + 1 integers for the column pointers, nnz integers for the row 
indices, and nnz real numbers for the matrix elements. 

An integer number is represented with 4 bytes and a real number with 8 bytes. The total amount 
of bytes to store a matrix with n columns and nnz nonzero elements is then 12nnz+4n + 4 bytes. !! 


Sparse matrices in MATLAB 


The conversion to a sparse matrix is not automatic in MATLAB. The user must execute an explicit 
command. However, once initialized, the sparse storage is propagated, that is, an operation with 
a sparse matrix produces a sparse result, except for addition and subtraction. The conversion is 
obtained with the functions sparse and full. 


A= [0 1 0; 2 0 3; 4 0 0]; 
B = sparse (A) 
Cc = full(B) 
B = 
(2,1) 2 
(3,1) 4 
(1,2) 1 
(2,3) 3 
C= 
0) 1 0 
2 0 3 
4 (0) 0 


It would not be practical to create a sparse matrix by converting a full matrix. Therefore, it 
is possible to create a sparse matrix directly by specifying the list of nonzero elements and their 
corresponding indices. The command 


S = sparse(i,j,e,m,n,nzmax) ; 


creates a sparse matrix S of size m x n with Sick), j) = s(k) with preallocated memory for nz- 
max elements. The preallocation of memory is a good practice to speed up the code but is not 
mandatory. In the example below, we code the 3 x 3 matrix A given above without preallocating 
memory. 


[2.3 2. 2]; 
= sparse(i,j, 


i j 11223); e= [2 41 31; 
S e, 
[i,j e] = find(s) 

i 


= ll 
3,3)3 


OPW Nd 


11. MATLAB’s whos function might return a larger number due to automatic preallocation of storage space. 
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WNP 


Wr PB DY 


Note that if a matrix element appears more than once in the list, it is not overwritten, but the 
elements are summed. 

Particular sparse matrices can be created with the commands speye, spdiags, sprandn, 
etc. The command help sparfun lists the functions which are specific to sparse matrices. 


3.3.3 Structural properties of sparse matrices 


For sparse matrices we can investigate certain properties depending only on the sparsity structure, 
that is, the information about what elements are different from zero regardless of their numerical 
value. This analysis is particularly relevant for irregular sparse matrices where zero elements can 
be present on the main diagonal. In order to analyze these properties, it is convenient to associate 
an incidence matrix to the matrix under consideration. The incidence matrix M of a matrix A is 
defined as 


1 if aij Æ (0) 

Mij = nee . 
i 0 otherwise 

In the following, the structural properties are investigated by analyzing the incidence matrix. 


Structural rank 


Recall that the determinant of a square matrix M € R”*” can be formalized as 


det M = D sign(p)M1 pı M2p3 *** Mnpp» (3.4) 
peP 
where P is the set of n! permutations of the elements of the set {1, 2, ..., n} and sign(p) is a func- 


tion taking the values +1 and —1. From Eq. (3.4) we readily conclude that a necessary condition 
for the determinant being nonzero is the existence of at least one nonzero product in the summa- 
tion, which then corresponds to the existence of n nonzero elements in M, each row and column 
containing exactly one of these elements. This is equivalent to the existence of a permutation of the 
columns of M such that all elements on the main diagonal are nonzero. Such a set of elements is 
called a normalization of the equations or a matching. For more details see Gilli (1992) or Gilli and 
Garbely (1996). So, the existence of a normalization is a necessary condition for a matrix being 
nonsingular. 

The MATLAB function dmperm!* searches a matching W of maximal cardinality. The com- 
mand p = dmperm(M) returns an array p defined as p; = j if m;; € W and 0 elsewhere. For 
a complete matching, that is, of cardinality n, the permutation M (p, :) produces a matrix with 
nonzero elements on the diagonal. 

The maximum cardinality of a matching is called the structural rank of a matrix and corre- 
sponds to the number of nonzero elements returned by dmperm. The MATLAB function sprank 
computes this number as 


r = sum(dmperm(M) > 0); 


12. This function implements the algorithm by Dulmage and Mendelsohn (1963). 
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1 oa 114; 0 a o 

2 o 9 oo a 

3 oa oO 2 o 

4 o a 5 oO (m 

5 o o 12 o o 

6 o o 13 oa 

7 oa ia 10 ooa 

8 Oo a 1 oa 

9 a a im 7 000 

10 o oa 6 oo 

11; a o o 3 o ojo 

12 oO oO 4 oa 

13 oa 8 oa 
123 45 67 8 9 10111213 1132 83 411126 79 5 10 


FIGURE 3.3 Example of a decomposable matrix (left) and its block triangular form (right). 


Block triangular decomposition 


A sparse matrix might be decomposable, that is, there exists a permutation of the rows and columns 
such that the matrix has a block triangular shape with each matrix on the diagonal being square 
and indecomposable. If such a decomposition does not exist, the matrix is indecomposable. For a 
decomposable matrix, we can solve the subsystems recursively. In Fig. 3.3, we present a small ex- 
ample of a decomposable matrix. The MATLAB command dmperm computes the row permutation 
vector p and the column permutation vector q. The vectors r and s contain the row and column 
pointers for the indecomposable square matrices on the diagonal. The pointers correspond to the 
starting row/column of a block. So rı = 1 indicates that the first block starts at row 1 (same for the 
column) and r2 = 2 indicates that block 2 starts at row/column 2, hence the first block is composed 
by a single element. This is also the case for blocks 2 to 4. Block 5 then starts at row/column 5 
(rs = 5) and the succeeding block at row/column 9 (re = 9), which defines a block of order 4, and 
so on. 


[p,q,r,s] = dmperm(A) 
p= 

ii 9 2 5 12 13 10 1 7 6 3 4 8 
q 

113 2 8 3 4 11 12 #6 7 9 5 10 
r 

1 2 3 4 5 9 12 14 
sS 

Í 2 3 4 5 9 12 14 
spy (A(p,q) ) 


The MATLAB command spy produces a graph of the incidence matrix. In Fig. 3.3, rows and 
columns are labeled with the vectors p respectively q. In the right panel of the figure, blocks are 
framed, but frames are not produced by spy. 


Structurally singular matrices 


If a sparse matrix of order n is structurally singular, that is, with structural rank k < n, it might 
be of relevance to know where new nonzero elements have to be introduced in order to increase 
its rank. It can be shown (Gilli and Garbely, 1996) that the rows and columns of a structurally 
singular matrix can be permuted so as to bring out a submatrix of zeros of dimension v x w with 
v + w = 2n — k. To augment the structural rank, nonzero elements must then be introduced into 
this submatrix. Below, an example of a matrix of order 8 with structural rank 6. Fig. 3.4 exhibits 
the original matrix and the permutation showing the 6 x 4 zero submatrix. 
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1 Go o-oo 1 i oo oO oO 
2 o o 4 0 0 0 O0 ooo 
3 a (m 2 o o 
4 aA o oo o0 5 o.oo 
5 0 o0 6 0 o 
6 Oo a 3 o o 
7 o o 7 Oo o 
8 a o 8 oo 

1 2 3 4 5 6 7 B 4 8 1 2 3 5 6 7 


FIGURE 3.4 Example of a matrix of order 8 with structural rank 6 and the corresponding 6 x 4 zero submatrix (right 
panel). 


A = sparse(8,8); 
A(1 135) = Az; A(2,[3 6]) = 1; 
A(3:, [3 TIN = 1s A(4,[1 2 467 8]) = 1; 
A(5,[5 6]) = 1; A(6,[6 7]) = 1; 
A(7,[3 7]) = 1; A(8,[3 5]) 1; 
[p,q,r,s] = dmperm(A) 
p = 

1 4 2 5 6 3 7 8 
q = 

4 8 T 2 3 5 6 7 
re 

1 3 9 
e S 

1 5 9 
spy (A(p,q) ) 


3.4 The Least Squares problem 


Fitting a model to observations is among the most fundamental problems in empirical modeling. We 
will discuss the case in which the number of observations is greater than the number of parameters. 
According to the choice of the model, this will involve the solution of a linear system. 

The “best” solution can be determined in several ways. Let us recall the usual notation in statis- 
tics (and econometrics), where y are the observations of the dependent variable, X the independent 
variables, and f (X, 6) the model with £ the vector of parameters. The solution chosen then corre- 
sponds to the minimization problem 


min || f(X, B) — y IŻ. 


In numerical methods, a different notation is used. The problem is written as Ax = b, where b are 
the observations, !? A the independent variables, and x is the vector of parameters. Such a system 
with A €e R”*"”, m >n is called an overidentified system. Generally, such a system has no exact 
solution, and we define the vector of residuals as 


r=Ax-—b. 


13. Another common notation is Ax ~ b that serves as a reminder that there is, in general, no equality between the left- and 
right-hand vectors. 
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The Least Squares solution is then obtained by solving the following minimization problem 


min g(x) = : rora) = : 2" (x)’, (3.5) 


with r(x) the vector of residuals as a function of the parameters x. 

The particular choice of minimizing the Euclidean norm of the residuals is motivated by two 
reasons: i) in statistics such a solution corresponds, in the case of the classical linear model with 
i.i.d. residuals, to the best linear unbiased estimator (BLUE); ii) the solution can be obtained with 
relatively simple and efficient numerical procedures. 

The Least Squares approach is attributed to Gauss who in 1795 (at the age of 18) discovered 
the method. The development of modern numerical procedures for the solution of Least Squares 
has taken place in the 1960s with the introduction of the QR and SVD decompositions and more 
recently with the development of methods for large and sparse linear systems. 

Often, and particularly for problems arising in statistics and econometrics, we are not only 
interested in the solution x of the Least Squares problem Ax = b but we also want to compute the 
variance—covariance matrix o7(A’A)~!. Therefore, we need to evaluate ||b — Ax I3 at the solution 
as this quantity appears in the evaluation of o? = ||b — Ax I3 /(m —n), and we also need an efficient 
and numerically stable method to compute (A’A)~! or a subset of elements of this matrix. 

In the following sections, we will explain how to perform this for the different chosen ap- 
proaches. 


3.4.1 Method of normal equations 


The solution for the problem defined in Eq. (3.5) can be obtained in different ways. One way is to 
take the derivative of Eq. (3.5) with respect to x and write the first-order condition for a minimum, 
that is, 


2A’ Ax —2A'b=0, 
which corresponds to the linear system 
A'Ax = A'b, 
called the system of normal equations. 
The system of normal equations can also be derived by geometrical considerations. The closest 


solution b to Ax is given by the orthogonal projection of b into the space generated by the columns 
of A. 


The condition for the projection to be orthogonal is 
A'r = A' (b — Ax) =0 
from where we again obtain the system of normal equations A’ Ax = A’b. 


If the matrix A is of full rank, A’A is positive-definite and the system of normal equations can 
be solved by resorting to the Cholesky factorization (see Algorithm | 1). 
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Computation of \\r ||5 


If the sum of residuals ||r I3 is needed, we form the bordered matrix 
A=[A b] 
and consider the Cholesky factorization of the matrix C = AA, 


cele “lece «in g 
cœ b'b zZ p 


The solution x and the norm of the residuals are then computed as!4 


G'x=z and |[Ax—bll2=p. 


Computation of (A’A)! 
The matrix S = (A’A)~! can be computed avoiding the direct inversion of A’A by considering that 
S=(A‘A)7! 
=(GG)' 
=G' -1 G7! 
which necessitates only the computation of the inverse of a triangular matrix T = G7! and the 
computation of the lower triangle of the product T’T. The inverse T of a lower triangular matrix G 
is computed as 
"Ysii i=j 
=le; yay gikt TA j 


tij = 


and needs ”°/3 + 3/2n? — 5/6n elementary operations. !° 


If only the diagonal elements of S were needed, it would be sufficient to compute the square of 
the Euclidean norm of the vectors of T, that is, 


n 
Gye i=1,2,...,n. 
j=i 


Algorithm | 1 summarizes the solution of the Least Squares problem with normal equations. 


14. To demonstrate this result, we develop the expressions for A'A and GG" 


C c|]_|G OJ|@ z|_|GC Gz 
d bb) |z ojlo pl Lela’ zz+0 
where we recognize that G corresponds to the Cholesky matrix of A’ A and 


Gz=c and b'b=z7z +p. 


As r =b — Ax is orthogonal to Ax, we have 


Axl} = (r + Ax) Ax =b'Ax =b'AG' l z= zz 


al 


from which we derive 


| Ax — BIZ = x'A' Ax —2b' Ax +b'b = b'b — z'z = p° . 
stg a an 


aa Fy 


Zz zz 


15. It is also possible to compute matrix S without inverting matrix G (cf. Björck, 1996, p. 119, and Lawson and Hanson, 
1974, pp. 67-73). 
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Algorithm 11 Least Squares with normal equations. 


1: compute C = A'A # (only lower triangle needed) 
/ 
2: form C= C TAR 
b'A b'b 
=~ |G 0 i 
3: compute G=| | # (Cholesky matrix) 
Z P 
4: solve G’'x=z # (upper triangular system) 
5: o? = p2/(m—n) 
6: compute T = G # (inversion of triangular matrix) 
7T: S=0°T'T # (variance—covariance matrix) 


The method of normal equations is appealing as it relies on simple procedures such as the 


Cholesky factorization, matrix multiplications and forward and back-substitutions. Moreover, the 
initial matrix A of dimensions m x n is compressed into an n x n matrix which might constitute an 
advantage in situations where m œ n. 


Below a small example with the corresponding results and the MATLAB code to solve it with 


the normal equations approach are given. 


CAIADMHRWNE 


— = 
= O `O 


Listing 3.6: C-LinEqsLSP/M/./Ch03/ExNormalEquations.m 


% ExNormalEquations.m -- version 2003-11-12 
Asfi ist 2y bss. L Al}. bel2 1 bobs 
m,n] = size(A); 


[ 

C = A’*A; c = A’x*b; bb = b’xb; 
Char =; [C cy." “bb ) 7 

Gbar = chol(Cbar) ’; 


G = Ghar(1:2,1:2); z = Gboar(3,1:2)'’; rho = Gbar (3,3); 
x= Gr Nz 

sigma2 = rho^2/ (m-n) 

T = trilinv(G); % Inversion of triangular matrix 
S = T's«T; 


Mcov = sigma2*S 


x = 
2.0000 
-0.3000 
sigma2 = 
0.1500 
Mcov = 
0.2250 -0.0750 
-0.0750 0.0300 


The following MATLAB code overwrites a lower triangular matrix by its inverse. 


CDADNHRWN HE 


Listing 3.7: C-LinEqsLSP/M//Ch03/trilinv.m 


function L = trilinv(L) 
% trilinv.m -- version 1995-03-25 
for i = 1:length(L(:,1)) 
for j = 1:i-1 
L(i,j) = -L(i,j:i-1)*L(j:1-1,5) / L(i,i); 
end 
L(i,i) = 1/L(i,i); 
end 


The diagonal, or single elements of the variance—covariance matrix can be computed with the 


code 


m pi 
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S(I) = T(r nmi) A w Tny « sigma2s 


3.4.2 Least Squares via QR factorization 


An orthogonal transformation of a vector does not modify its length. This property is exploited 
when solving the Least Squares problem which necessitates the minimization of the length of the 
residual vector. For the discussion that follows it is convenient to partition the matrices Q and R of 
the QR decomposition as shown below. 


The length of the vector of residuals r to be minimized is 


Ax —b |2. 
|| Ax — bls 


r 


We replace matrix A by its factorization QR and apply to the vector of residuals r an orthogonal 
transformation by multiplying it by Q’, which gives || Q’ORx — Q'b I3, and as mentioned earlier, 
the length of r is not affected. Considering the partition of the matrices Q and R, 


2 
Lee 
0 Q, ` 


| Rix — OD I2 + I| 25b 3. 


and rewriting the expression, we get 


Solving the upper triangular system Rıx = Q/ b, we obtain the solution x, and the sum of squared 
residuals g(x) corresponds to || Q5b I2. 

In practice, we do not store matrix Q, but the columns are computed bit by bit as they are 
needed. Note that due to the orthogonality of Q, we have A'A = R'Q'QR = R' R, and, therefore, 
for the computation of (A’A)~!, we can proceed as we have done previously for the Cholesky 
factorization. 

We reconsider the same example we used for the normal equations approach and illustrate the 
solution via the QR factorization. 


Listing 3.8: C-LinEqsLSP/M/./Ch03/ExLSQR.m 


1)% ExLSQR.m -- version 1995-11-25 
2;A=[1 1; 1 2; 1 3; 1 4]; b=[2 11 1]'; 
3) [m,n] = size(A); 

4| [0,R] = qr(A); 

5| R1 = R(1:n,1:n); Q1 = Q(:,1:n); Q2 = Q(:,n+1:m); 
6|x = R1\(Q1’*b) 

TE = Q2'«b; 

8| sigma2 = r’*r/(m-n) 

9|/T = trilinv(R1’); 

0|S = T’*T; 

l| Mcov = sigma2+*S 
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x = 
2.0000 
-0.3000 
sigma2 = 
0.1500 
Mcov = 
0.2250 -0.0750 
-0.0750 0.0300 


3.4.3 Least Squares via SVD decomposition 


As is the case for the QR decomposition, we exploit the property that the length of a vector is 
invariant to orthogonal transformations. The SVD approach can handle situations in which the 
columns of matrix A € IR’”*" do not have full rank, that is, the rank r of matrix A satisfies r < p = 
min(m, n). We have 


E! 0 / r u;b 
= V r U'b = —Uj, 3.6 
x l l or x J = vi (3.6) 


where the matrices V, X, and U are matrices of the singular value decomposition A = UXV’. 
Matrix AT is the so-called pseudoinverse. Denote 


r= vix=[ 2] and e=v=] 2]; 
Z2 c2 


with zı and cı € R”. We now consider the vector of residuals (and its Euclidean norm) and apply 
an orthogonal transformation to the vector of residuals 


|| b — Ax lz =|| Ub — AVV'X) |l2 
— af x, 0 Z1 
c2 0 0 Z2 3 


c1 — &yZ1 
c2 


2 
The vector x as defined in Eq. (3.6) minimizes the sum of squared residuals given that c1 — Xz; =0 
and therefore SSR = ye 41 (u,b). Note that Eq. (3.6) produces a solution even in the case when 
the columns of A do not have full rank. 

From the singular value decomposition of matrix A, we derive that (A’A)~! = VE F 2V', where 
È, is the square submatrix of X. Then, we are able to express particular elements of matrix S = 
(A'A)! as 


n 


VikU jk 
s=} a 


k=1 OK 


Again we consider the example already used for the normal equation approach and the QR 
factorization to illustrate the solution via the SVD decomposition. 


Listing 3.9: C-LinEqsLSP/M/./Ch03/ExLSsvd.m 


l|% ExLSsvd.m -- version 1995-11-27 

AA [TL Ls 229 1.39 1 ayy b= [2-21 LA 
3| [m,n] = size(A); 

4) [U,S,V] = svd(A); 
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S}c = Usb; cl = c(1lin)? c2 = e(nt+1i:m); 
6 diag(S); 

7T| zl = c1./sv; 

8|x = Vezl 
9 
0 
1 


sv 


I 


sigma2 = c2'’«*c2/(m-n) 
S = Vediag(sv.%(-2))*V'; 
Mcov = sigma2*S 


x = 
2.0000 
-0.3000 
sigma2 = 
0.1500 
Mcov = 
0.22250 -0.0750 
-0.0750 0.0300 


3.4.4 Final remarks 


The solution via SVD decomposition is the most expensive but also the most accurate numerical 
procedure to solve the Least Squares problem: an accuracy that is seldom needed in problems re- 
lated to empirical finance. In modern software, the method of choice for the solution of the Least 
Squares problem is via QR factorization. However, if m >> n, the normal equations approach ne- 
cessitates about half the operations and memory space. In MATLAB, the backslash operator solves 
the problem via QR. 


The backslash operator in MATLAB 
Algorithm 12 illustrates how \ works with full matrices: 


Statement | tests whether A is square, Statement 3 verifies if A is lower triangular and State- 
ment 6 tests for the upper triangular case. If the condition in Statement 10 is true, A is symmetric 
and Statement 13 checks whether A is positive-definite. Finally, in Statement 22, a rectangular 
matrix A is identified and a Least Squares problem is solved via QR factorization (see Sec- 
tion 3.4.2). 


As mentioned previously, one should always avoid to compute the inverse of a matrix. This 
is approximately twice as expensive as the methods presented to solve linear systems. The 
solution X of the set of linear systems AX = I corresponds to A~!. 

If an inverse appears in an expression, generally, it hides the solution of a linear system. 
For instance, in the bilinear form s = q’ A—'r, A~'r is the solution of Ax = r, and, therefore, 
we can code 


s=q’ x (A\r) 


Appendix 3.A Solving linear systems in R 


In this appendix, we briefly summarize how linear systems can be solved in R. We limit the discus- 
sion to functions that are provided in base R; we shall not, for instance, discuss packages such as 
Matrix. 
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Algorithm 12 Pseudocode for how \ works in MATLAB. 


1: ifsize(A,1) ==size(A,2) then 

2: % Ais square 

3 if isequal (A, tril (A) ) then 
% A is lower triangular 
xX=A\b; 


% A is upper triangular 


4 
5 
6: else if isequal (A, triu(A) ) then 
7 
8 
9 


# forward substitution on b 


xX=A\b; # back-substitution on b 
else 
10: if isequal (A,A’) then 
11: % A is symmetric 
12: [R,p] = chol (A); 
13: if (p == 0) then 
14: % A is symmetric positive-definite 
15: X=R\(R‘\b); # forward and back-substitution 
16: return 
17: end if 
18: end if 
19: [L,U,P] =lu(A); # A is a general square matrix 
20: X=U\(L\(P*b)); # forward and back-substitution 
21: end if 
22: else 
23: % A is rectangular 
24: [Q,R] = qr (A); 
25: x=R\(Q'*xb); # back-substitution on Q’b 
26: end if 
solve 


Linear systems Ax = b can be solved in R with solve. For triangular systems, you can use 


forwardsolve and backsolve. To demonstrate the use of solve, we first create a small, 
5 x 5 system of equations. 


S ia <= & 
> xl <- numeric(n) + 1 
> x2 <- numeric(n) + 2 
> A <- array(sample(1:50, n»n, 
Glin = ea a) 

>A 

[,1] [,2] [,3] [,4] [,5] 
pai 20 T2 2 27 39 
[2; 6 26 8 30 16 
[By 15 8 27 1 35 
[4, 29 43 2 44 41 
[53 33 2 36 T3 24 
> bl <- A $*% x1 
> b2 <- A %*% X2 


replace 


TRU. 


Calling solve with arguments A and b1 should recover x1. 


> solve(A, 


b1) 


[create-matrix] 


[solve] 


[solve-both] 


[data] 


[ar] 
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PRPPPPRpU 


> solve(A, b2) 


[A 


NNNNN 


The function may solve several linear systems at once. 


> solve(A, cbhind(b1, b2)) 


[,1] [,2] 
[1,] 1 2 
[2,] 1 2 
[37 1 2 
[4,] I 2 
iSi 1 2 


Least Squares 


We wish to solve XB © y. 


Se iaie <= - ILO) 

> np <- 3 

= Of <a eey aaan eeno , clin = (ais; aa) 
> y <= rnorm(nr) 


If the computation is done in the context of linear regression, the function to use is 1m. 


Ss thins = 28) 


Call: 
lm(formula = y ~ X) 


Coefficients: 
(Intercept) X1 X2 X3 
0.1186 0.0540 -0.2143 -0.0381 


1m does much more than just compute the coefficients: it first constructs a model matrix (e.g. by 
adding a constant); and it computes residuals, standard errors and other quantities. If we only want 
the raw solution, it may thus be too slow. There are several alternatives. 

One is to directly use the QR decomposition (note that now we need to add a constant column). 


= (che gels (eloniarcl(al, 20), r) 


[1] 0.11859 0.05395 -0.21425 -0.03812 
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Better yet, there is actually a bare-bones version of 1m. 


= Sie (le Ibn, Ere (elantiovel (GL, 2), sy7))) 
List of 9 
$ qr : num [1:100, 1:4] -10 0.1 0.1 0.1 0.1.. 
$ coefficients: num [1:4] 0.1186 0.054 -0.2143 -0.0381 
$ residuals num [1:100] 1.247 1.034 0.401 0.346 
$ effects num [1:100] -0.764 0.28 2.148 -0.35 
$ rank int 4 
$ pivot int [1:4] 123 4 
$ qraux num [1:4] 1.1 1.21 1.13 1.04 
$ tol num 1e-07 
$ pivoted logi FALSE 


So ea ine seat 


E(Clsminvsl (al, 2) 4 


y)) 


List of 8 
$ coefficients Named num [1:4] 0.1186 0.054 -0.214.. 
vee ACER (e, "names" = Chr (eA): Magee nT Bog Nod 
$ residuals num [1:100] 1.247 1.034 0.401 0.346.. 
$ effects Named num [1:100] -0.764 0.28 2.148.. 
sar Attr(*,. “names")=—chr [17100] "xl" "x2" "x3" my 
$ rank int 4 
$ fitted.values: num [1:100] 0.429 -0.161 -0.144 0.0.. 
$ assign NULL 
$ qr :List of 5 
.$ qr : num [1:100, 1:4] -10 0.1 0.1 0.1 0.1 0.1.. 
.$. qraux: num [1:4] 1.1 1.21 1.13 1.04 
..$ pivot: int [1:4] 1 2 3 4 
.S. tol num 1e-07 
.$ rank int 4 
~~ attr(*, "class")= chr "gr" 


$ df.residual 


int 96 


> library ("rbenchmark" ) 

> benchmark(lm(y ~ X), 
aine sealietlosarcl (GL, 3), sZ)) 
ibis dealt: CEDEN 3X) 7 WY, 
ofa terolliifen(felopbiayel((4l, 2), SYA); 
onder = Miseuleyesiwe” , 
raslicatione = L000) g4] 


test replications elapsed relative 


2 .lm.fit(cbind(1, X), y) 1000 0.008 
3 Iim.fit(cbind(1, X), y) 1000 0.033 
4 qr.solve(cbind(1, X), y) 1000 0.048 
1 Im(y ~ X) 1000 0.455 


1.000 
4.125 
6.000 
56.875 


{lm-fit] 


[speed-comparison] 
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This chapter is intended as a simple introduction to finite difference methods for financial applica- 
tions. For a comprehensive introduction about these methods, we suggest! Farlow (1982) or Smith 
(1985). 


4.1 An example of a numerical solution 


Let us consider the following differential equation 


af 
ao =f), (4.1) 
ox 
which is a so-called ordinary differential equation as f is a function of a single variable. Solving a 
differential equation means finding a function f(x) that satisfies Eq. (4.1). For this simple case, we 


easily find that the general solution is 


C 
f@=-—. (4.2) 
x 
Indeed, the derivative of Eq. (4.2) is or = -5 and replacing f(x) in Eq. (4.1), we obtain the 
expression for the derivative f 
C 
ee 
x x2. 


The solution (4.2) contains an arbitrary constant C that disappears when we take its first deriva- 
tive. This means that there exists an infinite choice for C satisfying Eq. (4.1). In order to find a 
particular solution we have to impose some constraints on Eq. (4.2), as, for instance, that a given 
point has to belong to it. These are called initial conditions. For example, consider the initial con- 
dition f (2) = 1, generally written as xọ = 2 and yo = 1, and from which we derive that 1 = C/2 and 
C =2. In Fig. 4.1 you find the graph of the particular solution f(x) = 2/x. 

Three approaches can be considered to get the solution of a differential equation: 


e Search for an analytical solution; difficult in most situations and, worse, an analytical solution 
may often not exist. 

e Search for an approximation of the analytical solution. Generally, this involves great effort for 
finally having only an approximation, the precision of which may be hard to evaluate. 


1. Presentations of finite difference methods in finance books often lack simplicity. 
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FIGURE 4.1 Graph of function f (x) = 2/x. 


e Resort to a numerical approximation, which allows us to easily tackle any differential equation. 
A numerical method approximates the function satisfying the differential equation in a given 
finite number of points. Thus, while the analytical solution is a formula, a numerical solution is 
a table of approximations for given values of the independent variable. 


A first numerical approximation 


Let us use a numerical method to find the solution for the previous example. By replacing the 
derivative with a forward approximation, the differential equation as = — f (x)/x becomes 


dio EO Ly 


from which we get 


feirg jo (1 M 2s) 


Starting from x9 = 2, where we have f(xọ) = 1, we can determine the value of the function at 
xo+ Ax numerically. This defines the so-called explicit method.? The MATLAB® code in Listing 
4.1 implements this method. In the code, function values f(x;) are denoted by y; and a constant 
has been added to the index. The reason for this is that in MATLAB, as in some other languages 
(but not in C), the first element of an array is indexed with 1. In order to enhance the reading of the 
code, it would be preferable to have xo correspond to the address 0. This can be achieved with the 
following construction: 


e define £7=1 (offset) io 1l N 
e shift indexing by £7 address 1 2 N+1 
e x(£7+i) i=0,1,..., N x | xo x | wae 


Listing 4.1: C-FiniteDifferences/M/./Ch04/ODE1la.m 


1|% ODEla.m -- version 1997-12-09 

2 f7 = -1 

3)x0 = 2; yO = 1; 

4i xN = 10; N = 10; 

5| dx = (xN - x0)/N; 

6|x = linspace(x0,xN,N+1); y = zeros(1,N+1); 

Tl y(£7+0) = y0; 

8| for i = 1:N 

9 y(£7+i) = y(£7+i-1) * ( 1 - dx/x(f7+i-1) ); 
0| end 


2. More generally, a method is called explicit whenever we compute the function value at a grid node directly from other 
nodes whose values are known. 
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0 2 10 
FIGURE 4.2 Numerical approximation with the explicit method (circles) of the differential equation defined in Eq. (4.1) 
for N = 10 (left panel) and N = 30 (right panel). 


Fig. 4.2 plots the numerical approximation of the solution of the differential equation given in 
Eq. (4.1) computed with the code ODE1a .m. Note that the solution is computed only for N points. 
Also, we observe how the precision depends on increasing values of N. 


A second numerical approximation 


We may also approximate the derivative in the differential equation given in Eq. (4.1) by a central 
difference 


fa t+Ax) — fa- Ax) 
2 Ax 


=—f(x)/x. 


Isolating f(x) we get 


Xx 
f= 7h04) f@+40). 


To solve this expression, we need to know two points of the solution, that is, f(x — ^x) and 
f(x+ Ax), thus introducing an additional initial condition. Let us write the approximation for 
four successive points (x1, y1), (x2, y2), (x3, y3), and (x4, y4): 


yı = x1 0 — Y2)/ZAx) 
y2 = x2(y1 — y3)/2 Ax) 
y3 = X3(y2 — ya)/(2Ax) 
y4 = x4(y3 — y5)/(2 Ax) 


Note that this is a linear system. In matrix form, we have 


1 Cc} yı c1 yo 
=c | c2 y| 0 
—C3 1 C3 y3 0 

—C4 1 y4 —C4 y5 


with c; = x; /(2Ax), and we can solve the system if yo and ys are known. This method is called 
the implicit method.’ In this particular case, the initial condition is easy to determine; for the final 
condition, we choose zero, as we already know that the function tends to zero. If we are “far 
enough to the right,” the precision of such an approximation is acceptable. The MATLAB code that 
implements the implicit method in which the derivative is approximated with a central difference is 
given with the code ODE1b.m.* The MATLAB function spdiags, used in the code, is explained 
in more details on page 86. 


3. More generally, an implicit method is one in which a node is computed as the solution of a system of equations. The 
explicit and implicit methods are also called explicit and implicit Euler methods. 
4. An alternative method to solve the tridiagonal linear system of the example is given on page 46. 
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Listing 4.2: C-FiniteDifferences/M/./Ch04/ODE1b.m 


1|% ODE1b.m -- version 1997-12-09 

2 f7 = 1}; 

3x0 = 2; yO = 1; % Initial condition 
4| xN = 30; yN = 0; &% Final condition 
5N = 30; % Number of points 
6)dx = (xN - x0)/N; 

7|x = linspace(x0,xN,N+1); 

8|y(£7+0) = y0; 

9|x(£7+0) = x0; 

10) for i = 1:N 

11 c(i) = x(£7+i)/(2*dx); 

12) end 

13| A=spdiags([[-c(2:N) NaN]’ ones(N,1) [NaN c(1:N-1)]’],-1:1,N,N); 
14)/b = [c(1)*yO zeros(1,N-2) -c(N) *yN]’; 
15) y(£7+(1:N)) = A\b; 


Fig. 4.3 shows approximations for different final conditions and number of steps. From the two 
examples, it clearly appears that the choice of the numerical method, as well as the step length, 
and final conditions for the implicit method, determine the quality of the approximation and the 
computational complexity. In particular for the implicit method, the smaller the step size, the larger 
the linear system we have to solve. 


8 © 2000000 

00000 0 

2 10 30 2 10 

FIGURE 4.3 Numerical approximation with the implicit method (circles) of the differential equation defined in Eq. (4.1) 
for xy = 30 and N = 30 (left panel) and xy = 100 and N = 80 (right panel). 


4.2 Classification of differential equations 


Let u(x, y) be a function of x and y, and let the partial derivatives of u be indicated by subscripts, 


that is, ux = outa ) ar ) We distinguish two classes of differential equations: 


e Ordinary differential equations (ODE), characterized by a function with a single independent 
variable 
e Partial differential equations (PDE). In this case, the function has several independent variables 


and uyy = 


Partial differential equations are also classified according to the following criteria (e.g., for a func- 
tion u(x, y)): 


e Highest degree of derivative 
e Linear function of derivatives with constant coefficients, for example, 


Uxx + Bl uxy + Uyy + ux —u=e* 
e Linear function of derivatives with variable coefficients, for example, 
sin(xy) uxx + 3x? Uxy + Uyy +uxy —u=0 
e Nonlinear function of derivatives, for example, 


Uxx + 3Uxy + Uyy + DA —u =e" 
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A particular class encapsulates linear two-dimensional partial differential equations of degree 
two, that is, of the form 


a uxx + b Uxy + C Uyy +duy +euy + fu+g=0, 
which again, according to the value of the discriminant, is partitioned into 


>0 hyperbolic equation 
b —4ac{=0 parabolic equation 
<0 elliptic equation. 


Note that with variable coefficients we can shift from one situation to another. Examples of linear 
two dimensional PDEs of degree two are: 


e Hyperbolic PDEs for time processes, for example, wave motion, 
Uxx — Uy =O. 

e Parabolic PDEs for propagation, for example, heat propagation, 
uy — Ux, = 0. 

e Elliptic PDEs for modeling systems in equilibrium, for example, Laplace equation, 
Uxx + Uyy =O. 


There are three components when working with a PDE: 

e The PDE 

e The spatial and temporal domains the PDE has to satisfy 

e The initial conditions (at time 0) and the conditions at the boundary of the domain 


4.3 The Black-Scholes equation 


The Black-Scholes equation is a basic model used to price European style options. The equation 
writes 


V, + $0°S* Vss +(r—G)SVs—rV =0, 


where V is the option price, S and ø are the price and the volatility of the underlying asset, respec- 
tively, r is the annualized continuously compounded risk-free rate of return, and q is the annualized 
continuously compounded yield of the underlying. V; is the derivative with respect to time. Vs and 
Vss are the first- and second-order derivatives with respect to the price of the underlying. Time is 
measured in years. 

The Black-Scholes equation is a PDE with the following characteristics: 


Linear with variable coefficients 

First order in time 

Second order in space (that is, the stock price) 
Parabolic PDE (a = 0 and b = 0 in the discriminant) 


One reason for the popularity of the Black-Scholes equation is the existence of an analytical 
solution. For the call, this solution is 


C(S,t) = SeT À N(d) — Xe" FY N(), 


where X is the exercise or strike price of the option and N(z) = Tz Jg e—*/2dz is the standard 
IT 
normal cumulative distribution function, its arguments are 


o? — / 
dı = TEUDEE 7 = METO" aad dy=d|—ovT—t. 
ee ae 
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Substituting the expression for the call in the put-call parity, we get the solution for the put 


P(S,t) =C(S,t)-Se 47 4 X eT- 
= Se 47-9 N(d\) — Xe TF N(dy) — SEAT 4 X CTT 
= X e" T- (1 — N(d2)) — S e717- (1 — N(d1)) 
= X e" T- N(—d2) — S e1 -O N(—d). 


The call and put formule are implemented in the MATLAB code BScall.m and BSput .m. 
The standard normal cumulative distribution is computed with the MATLAB function normcdf. 
Note that element-by-element multiplications and divisions have been coded so that it will be pos- 
sible to provide a particular input argument as an array for which the function will then return the 
corresponding array of prices. 


Listing 4.3: C-FiniteDifferences/M/./Ch04/BScall.m 


1| function C = BScall(S,X,r,q,T,sigma) 
2|% BScall.m -- version 2007-04-05 
3|% Pricing of European call with Black-Scholes 
4|da1 = (log(S./X) + (r-qtsigma.*2 /2) .* T) ./ (sigma.*sqrt(T)); 
5}/d2 = dl - sigma .* sqrt(T); 
6|C =S .* exp(-q .* T) .* normcdf (d1) - ... 
7 X .* exp(-r .* T) .* normcdf(d2); 
Listing 4.4: C-FiniteDifferences/M/./Ch04/BSput.m 
1| function P = BSput(S,X,r,q,T,sigma) 
2|% BSput.m -- version 2004-04-23 
3|% Pricing of European put with Black-Scholes 
4/dl = (log(S./X) + (r-qtsigma.*2 /2) .* T) ./ (sigma.*sqrt(T)); 
5}/d2 = dl - sigma .* sqrt(T); 
6)/P = X .* exp(-r .* T) .* normcdf(-d2) - 
7 S .* exp(-q .* T) .* normcdf(-d1); 


S = 10; X = 8; r = 0.03; q = 0; sigma = 0.20; T 
C = BScall(S,X,r,q,T,sigma) ; 

P = BSput(S,X,r,q,T,sigma) ; 

fprintf(’\n Call = %6.4f Put = %6.4£\n’,C,P) 


Il 
N 


Call = 2.6687 Put = 0.2028 


Fig. 4.4 plots the prices computed with the Black-Scholes equation for the call (left panel) 
and the put (right panel) options as a function of the price S of the underlying asset and the time 
T — t to maturity. For the remaining parameters we have chosen exercise price X = 20, interest rate 
r = 10%, dividend yield q = 0, and volatility o = 60%. 


FIGURE 4.4 Price of call option (left panel) and price of put option (right panel) for X = 20, r = 0.10, q = 0 and o = 0.60. 
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4.3.1 Explicit, implicit, and 6-methods 


In the following, the option prices will be computed with a numerical method, and the analytical 
prices will serve as a benchmark. As already mentioned, in order to apply a numerical method, we 
need to define the domain, that is, the range of the variables t and S, for which we want to solve the 
PDE and specify the initial and boundary conditions. In Fig. 4.4, the initial or terminal conditions 
(see footnote 5), plotted with a solid line, correspond to the option price at expiration. These prices 
are known exactly. The boundary conditions, plotted in dash dot in Fig. 4.4, correspond to the prices 
of the option at the limits of the domain defined for S. The numerical method will then approximate 
the prices on the nodes of a grid which also has to be specified. 


4.3.2 Initial and boundary conditions and definition of the grid 


The initial and boundary conditions for a European call are 


V(S, T) = max(S — X, 0) 


V(0,t)=0 
lims 50 V(S,t)=S, 


while the initial and boundary conditions for a put are 
V(S, T) = max(X — S, 0) 


V(0,t)= Xe" 7-9 
lims-+o0 V(S, t) =0 


with X as the exercise price. 

The space domain is specified by the interval [Smin, Smax] for the underlying and the number N 
of grid points for which we want to compute the numerical approximation. For the time domain, 
the elapsed time goes from 0 to T in M steps, but as in a binomial tree, we will walk backward 
from T to 0.° The length of the space steps is then As= (Smax — Smin)/N, and the length of the 
time steps is A;= T/M. The numerical approximations of V (S, t) will be (only) computed at the 
grid points at the following rows and columns 


Si = Smin ti Ag i=0,...,N 
th = JA j=0, .M. 
The nodes in the grid are denoted Vij, and we also have 
vij = V (Smin +iAs, jAr), i=0,...,N, j=0,...,M, 


the values of V (S,t) at the (N + 1) x (M + 1) grid points. The time grid could also have been 
defined as tj = to + j Ar. However, as we always have to = 0, contrarily to Smin for which we can 
also have Smin > 0, we simply write t; = j A;. Fig. 4.5 illustrates the grid. 

Terminal conditions correspond to the nodes in dark in column M, and boundary conditions 
correspond to the gray nodes in row 0 and N. The other (unknown) nodes will be evaluated starting 
from right (the terminal conditions) to left. 


5. A typical parabolic PDE is a function of time and space. Such an equation is then solved forward, that is, given the 
spatial boundary conditions, we start from some initial condition at time t = 0 and step forward in time. The Black-Scholes 
equation is such a PDE; the space variable is the price of the underlying. The equation can be transformed in various ways; 
in fact, it is equivalent to the heat equation in physics; see, for instance, Wilmott et al. (1993). In this chapter, we will not 
transform the equation but leave it in its original form. We do not need to understand anything about PDEs to see that now 
we cannot solve the equation forward in time, simply because we have no initial conditions. We could now transform time to 
run backward (that is, switch from calendar time ¢ to remaining lifetime T — t). But a conceptually much simpler approach 
is to solve the equation backward. As in a binomial tree, we start from the terminal condition (the payoff at T), and then 
step backward through the grid, from time T to time 0. 
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0 put 


N Pa | max(Smax~ X, 0) call 


E EE E E e e 


a a oe 


max(X—Siin—/Ag, 0) put 
mMax(Spint+iAg—X, 0) call 


(0) 1 2 j M 
T 

Ar Xe'"T-i^) put 

{ 0 call 


FIGURE 4.5 Finite difference grid with terminal conditions (dark circles) and boundary conditions (gray circles). 


Considering only function evaluations at the grid points v;j, i = 0,..., N, j =0,...,M, we 
approximate the partial derivatives with respect to time and space in the Black-Scholes equation. 
The derivative V, of the option with respect to time is approximated with a backward difference, and 
the derivative of the option with respect to the stock price is approximated with a central difference 


Vij — Vi,j-1 Vitli j — Uist] 
Vi li=ja;, ¥ — Vs |S=Smintids © : = 

t |[t=]JAt At minti As 2As 
The second-order derivative of the option with respect to the stock price is approximated with 


Vi+1,j — Vij + Vi—1,j 
z . 
As 


Vss |S=Smintids © 


We observe that four grid points are involved in the numerical approximation of the derivatives of 
vij. These four grid points are marked in Fig. 4.6. 


a) 


Vi+1,j 

; ` Vi j-1 : Vij 
jO Orra OO: @ Do Jo e 

Vyas 
10- O On O- @ Or: e 
0e © è © e © et 

0 1 2 j M 
> 
At 


FIGURE 4.6 Four points involved in the numerical approximation of the derivatives of v;j. 
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We now write the Black-Scholes equation 
Vi + 5 S? Vss + (7 —g)SVs—rV =0 


for a given node (i, j) in the grid by replacing V by v;; and the derivatives by the approximations 
introduced before 


Ui a a a VAS 2vij + vi—1,j 
zi 2 
At AS 
Vi+1,j — Vi-1,j 
OS QSia a U0 
s 


; Aro? S? —q)S; ae des : ' 
Denoting a; = a L and bj = ae and multiplying by A;, we rewrite the equation as 


Vij — Vi,j-1 + ai (Vi+1,j — 204; + Vi—1,j) + bi Wi+1,j — Vi—1,j)— Arr vij = 9. 


By factoring identical nodes, we get 


vij — Vi, j-1 + Gi — bi) vi—1,j + (—2ai— Arr) vij + (ai + bi) vi+1, j =0. (4.3) 
—— m — 
di Mi ui 

The N — 1 equations involving the N — 1 nodes v;j, i = 1,..., N — 1, of column j in the grid, are 
then 

vj = vj- + dvoj + mvj + wvj =0 

vj — vj- + dvwi,j + mwj + uwj = 0 

vij = Vija + divij + mvj + Uiviyj = 0 


VN-1,j — UN-1,j-1 + dy-1VN-2,j + MN-1¥UN-1,7 + UN-10N,j; = Q. 


Before discussing how to solve for the unknown values on the grid nodes, we proceed with some 
reformulation of the presentation of the problem. First, we resort to a matrix notation 


VEN-1,j — VN-1,j-1 + P von, j =9, (4.4) 


where matrix P is as follows: 


dmu 0 c e 0 
0 dz m u2 
‘ie d3 m3 
uy-2 0 
Oise ces 0 dy—1 my—1 UN-1 


Second, we introduce a set notation for the indices of the vectors by denoting the rows of the grid 
by the set Q for which we consider the following partition: 


Q=QVIAQ and QNIN=PD. 


Q is set of interior points, that is, i = 1,..., N — 1, and dQ is the set of points lying on the border, 
that is, i = 0, N of the grid. Applying this partition to the columns of P defines two matrices A and 
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B (columns forming matrix B are in gray tone) 


d m u O > -e 0 


0 d) m u 


"ET =| B. A B2]. 


UN—2 0 


) ... O dy- my—1 Oyen 


Note that the columns of P correspond to the rows of the grid (first column is row zero). The 
following notation 


J J J 
P v5 = Avo +B vigo, 


where 


2 
=v, i=1,...,N—1 
Q= Vij, t5 1,..., , 


vio = Vij, i =0,N, 
then corresponds to the product P vo:y,j in Eq. (4.4). Substituting this product, we get 
vå -vb | + Av, + Bulg =0. (4.5) 


Different approaches can be considered to compute the solution ue, which is the value of the 


nodes in the leftmost column (j = 0) of the grid. The explicit method proceeds in isolating vs! in 


Eq. (4.5), 
vi, =(1+A)vg+ Brio, 


and then solving the equation for j = M, M — 1, ..., 1. The left panel in Fig. 4.7 highlights the grid 
nodes involved when solving for node v;, j—1. Note that the explicit method only involves matrix 
vector products. 

If we approximate V;, the derivative of V with respect to time, with a forward difference 


Vi ltajar © (Wi, j+1 — Vi, j)/ At 
Eq. (4.5) becomes 
vbt! — vb + Av, + Brig =0, (4.6) 


and the solution v9 is obtained by successively solving the linear system 
S aa j 
(I-A) =va +B vig 


for j =M —1,...,1,0. This defines the implicit method.The right panel in Fig. 4.7 highlights the 
grid nodes involved when solving for v;;. 
The explicit and the implicit method can be combined by considering Eq. (4.6) at time step j 


jt+l j j ji ee 
Vo Vg tAvgt+ Buj9=0, 
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FIGURE 4.7 Explicit method (left) and implicit method (right). 


and Eq. (4.5) at time step j + 1 
i i TE sie 
vů" =o + Avs? + Bois =0. 
We now write a linear combination of the two previous equations: 


a (uit! uh + Aud + Bulg) + (1-0) (wht vi + Avi + Build!) =0 


with 6 € [0, 1]. Rearranging, we obtain 
Am Vs, = Ap vat! +0 Bulg +(1— 4) Buss! , (4.7) 


where Am = I — 0A and Ap = I + (1 — 0)A. The solution ue, is computed by solving the linear 
system (4.7) for j = M—1,..., 1, 0. This defines the 6-method. The particular case in which 6 = 5 
defines the Crank—Nicolson method. Algorithm 13 summarizes the 6-method, and the following 
table shows how the 6-method generalizes all methods. 


Method Am Ap 
Explicit I I+A 
1 Implicit I-A I 


Crank-Nicolson I— 5A I+ 


NIK 


Algorithm 13 6-method. 


1: define terminal and boundary conditions wv 3 rama and 0 
2: compute Am, Ap, and B 
3: for j =M—1:-1:0do 

3 a : 4] 
4: solve Am vi =Apvi' +0 Bui, + (1-0) Buss 
5: end for 


4.3.3 Implementation of the 6-method with MATLAB 


As illustration of the @-method, we suggest the evaluation of a European put option® with the MAT- 
LAB code FDM1DEuPut. Following Algorithm 13, the implementation of the 6-method essentially 


6. An application with higher space dimensions can be found in Gilli et al. (2002), where option prices are computed on 
the maximum of three underlying assets. 
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consists of two steps, computing the matrices Am, Ap, and B (Statement 2) and the solution of the 
linear systems in Statement 4. 

For the first step, we suggest the MATLAB code GridU1D.m which implements a discretization 
using a uniform grid. Smin and Smax define the interval for the space variable, and M and N define 
the number of time and space steps, respectively. The generated matrices Am, Ap, and B are sparse 
matrices. The array S stores the values of the underlying on the grid nodes. 


Listing 4.5: C-FiniteDifferences/M/./Ch04/GridU1D.m 


1) function [Am,Ap,B,S] = GridU1D(r,q,T,sigma, Smin, Smax,M,N, theta) 
2|% Gridu1lD.m -- version 2004-04-14 

3|% Uniform grid in original variables 

4| £7 = 1; dt = T/M; 

5}/S = linspace(Smin, Smax,N+1)’; 

6|dS = (Smax - Smin) / N; 

Tila = (dt*sigma*2*S(£7+(1:N-1)).*2) ./ (2xdS^2); 
8)b = (dt«*(r-q)*S(£7+(1:N-1))) / (2*dS); 

9d =a - b; m = - 2xa - dter; u =a+ b; 

10| P = spdiags([d m u], 0:2, N-1, N+1); 

11] Am = speye(N-1) - theta *P(:,£7+(1:N-1)); 
12| Ap = speye(N-1) + (1-theta)*P(:,£7+(1:N-1)); 
13B = P(:,£7+[0 Nl); 


For the second step, we observe that for the successive computations for the solution of the linear 
system, we only address nodes from columns j and j + 1. Thus, in the code, we use two arrays, VO 
for storing column j and V1 for column j + 1. Once VO is computed, it is copied to V1 for the next 
iteration. For the row indices, we recall that, according to the definition introduced on page 69, Q 
stands fori =1,..., N — 1 and dQ for i = 0, N. VO is initialized in Statement 5 with the terminal 
conditions for the put. The boundary conditions for i = 0 are computed in Statement 10, and for 
i = N, we propagate the terminal condition for that node when we copy VO to V1 in Statement 9. 
Note that the coefficient matrix A,, is factorized only once (Statement 7) and that we proceed with 
forward and back-substitutions (Statement 13). The illustration below explains how the set notation 
for the indices maps into the nodes of the grid. 


vo Vi 
j fA 
Ne © 
N-10- ® 
. a me 
Amv, = Apvg +0 Bvjo + (1-6) Buys 
Sn” Sa Ss Sa’ 
vO(1:N—-1) V1(1:N—1) VO([0N]) v1i( [ON] ) 
—j|[]{—_ Kum <“~— 
b 


Listing 4.6: C-FiniteDifferences/M/./Ch04/FDM1DEuPut.m 


function [P,Svec] = FDM1DEuPut(S,X,r,q,T,sigma, Smin, Smax,M,N, theta) 
% FDM1IDEuPut.m -- version 2006-06-12 

% Finite difference method for European put 

[Am,Ap,B,Svec] = GridU1D(r,q,T,sigma, Smin, Smax,M,N, theta); 

vO = max(X - Svec,0); % Initial conditions 

Solve linear system for succesive time steps 

,U] = lu(Am); £7 = 1; dt = T/M: 

r j = M-1:-1:0 


AAIADMNMBPWNe 


[L 
fo 
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9 v1 = V0; 

10 VO(£7+0) = (X-Svec(f7+0))x»xexp(-r»(T-j*dt)); 

11 b = Ap*«v1(£7+(1:N-1)) + theta *B*V0(£7+[0 N]) 

12 + (1-theta) *B*V1(f£7+[0 N]); 

13 V0O(£7+(1:N-1)) = U\(L\b); 

14| end 

15| if nargout==2, P=V0; else P=interpl(Svec,V0,S,’spline’); end 


FDM1DEuPut .m returns two output arguments, that is, Svec, which is the array of N + 1 
values of the underlying on the grid going from Smin to Smax, and the corresponding prices P of 
the option. If the function is called with a single output argument, the price corresponding to an 
arbitrary value of S, not necessarily lying in the grid, is computed by interpolation. This is done in 
Statement 15. 

In the following, we give numerical results for a European put with underlying S = 50, strike 
X = 50, risk-free rate r = 0.10, time to maturity T = 5/12, volatility ø = 0.40, and dividend yield 
q = 0. The space grid goes from Smin = 0 to Smax = 150; time and space steps are, respectively, 
100 and 500. The parameter 0 has been successively set to !/2, 0, and 1, thus producing results for 
the Crank—Nicolson, explicit, and implicit methods. The Black-Scholes price has been computed 
as a benchmark. 


S 2-507 X= 507 r= '0.10; DP — 5/12; ‘sigma: = 0.407. -q = 0; 
P = BSput(S,X,r,q,T,sigma) 
P = 

4.0760 


Smin = 0; Smax = 150; M = 100; N = 500; theta = 1/2; 
P = FDM1DEuPut(S,X,r,q,T,sigma, Smin, Smax,M,N, theta) 


P = 
4.0760 
theta = 0; 
P = FDM1DEuPut (S,X,r,q,T, sigma, Smin, Smax,M,N, theta) 
P = 
-2.9365e+154 
theta = 1; 
P = FDM1DEuPut (S,X,r,q,T, sigma, Smin, Smax,M,N, theta) 
P = 


4.0695 


We observe reasonable accuracy for the Crank—Nicolson and implicit methods; however, the 
explicit method fails. With M = 1000 time steps and N = 100 space steps, the explicit method 
achieves a good result. 


theta = 0; M = 1000; N = 100; 
P = FDM1DEuPut (S,X,r,q,T,sigma, Smin, Smax,M,N, theta) 
P = 

4.0756 


4.3.4 Stability 


The precision for finite difference methods depends, among other things, on the number of time and 
space steps. These issues are known as the problem of stability that is caused by an ill-conditioned 
linear system. We will not discuss this problem in detail but simply give some intuition. We have 
to be aware that increasing the number of space steps increases the size of the linear system, and 
the number of time steps defines how often the linear system has to be solved. Also, the parameters 
like volatility o and risk-free rate r influence the conditioning of the linear system. 
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In Fig. 4.8, we compare the results for increasing time steps and observe that the three meth- 
ods converge to identical precision. Errors are plotted as the difference between the analytical 
Black-Scholes prices and the finite difference prices. Note that errors are largest with at-the-money 
options. 


0.04 + 0.04 
0.03 + 0.08 teh aap Nd 
0.02 + 0.02 
0.01 0.01 
0 0 
5 10 15 5 10 15 


FIGURE 4.8 M = 30 (left panel) and M = 500 (right panel). Common parameters: N = 100, ø = 0.20, r = 0.05, q = 0, 
xX=S=10. 


In Fig. 4.9, we compare the results for varying values of o. The explicit method first produces 
the most accurate results but becomes unstable, that is, strongly oscillates (appearing as gray grid) 
if o is increased. 
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FIGURE 4.9 o = 0.20 (left panel) and o = 0.40 (right panel). Common parameters: N = 100, M = 30, r = 0.05, q = 0, 
X=S=10. 


Another illustration of the unstable behavior of the explicit method is given in Fig. 4.10 where, 
first, for a given number of space steps, the explicit method is the most accurate. Increasing the 
space steps improves accuracy of the implicit and Crank—Nicolson methods but destroys the results 
for the explicit method (gray grid of oscillations). 


O04 cette cae arene Exp |] 004 

003 Ho-o an -| 0.03 

002b 002 
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o kr E T T SSA 0 
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FIGURE 4.10 N = 310 (left panel) and N = 700 (right panel). Common parameters: M = 30, ø = 0.20, r = 0.05, q =0, 
xX =10. 


Finally, in Fig. 4.11, we see that the accuracy of the Crank—Nicolson method remains almost 
unchanged for an increased value of ø. The conclusion we draw from these comparisons is that the 
behavior of the Crank—Nicolson method is the most stable as its accuracy increases monotonically 
with the number of space steps N. 
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FIGURE 4.11 N = 700, M = 30, ø = 0.50, r = 0.05, X = 10. 
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Example 4.1 


We give an example of an exotic option that can be conveniently priced with a finite difference method. 
We consider barrier options that are path dependent as they are activated or null if the underlying crosses 
a certain predetermined level. We consider the particular case of a down-and-out put which becomes 
void if the price of the underlying crosses a given barrier Sp in a down movement. At the beginning of 
the option’s life, the price So of the underlying and the exercise price X satisfy 


So>Sp and X>Sp. 
We consider the domain S > Sp with the boundary conditions 
V(Sp,t) =0 and poua V(S,t)=0 
and the terminal condition 


V(S, T) = max(X — S, 0). 


Modifying the boundary condition and setting Smin = Sp in the code FDM1DEuPut .m will then solve 
the problem. Below, you find the modified code FDDownOutPut.m. In place of Statement 10 in 
FDM1DEuPut .m the new boundary conditions are specified in Statement 7 and they propagate when 
copying VO onto V1 in Statement 11. 


Listing 4.7: C-FiniteDifferences/M/./Ch04/FDDownOutPut.m 


function [P,Svec] = FDDownOutPut (S,X,r,q,T,sigma, Smin, Smax,M,N, theta) 
% FDDownOutPut.m -- version 2010-11-06 
% Finite difference method for down and out put 
[Am,Ap,B,Svec] = GridU1D(r,q,T,sigma,Smin, Smax,M,N,theta) ; 
EF w 1e 
VO = max(X - Svec,0); % Initial conditions 
VO(£7+0) = 0; % Boundary condition for down and out put 
% Solve linear system for succesive time steps 
[L,U] = lu(Am); 
for j = M-1:-1:0 

Vl = VO; 

b = Ap*«v1(£7+(1:N-1)) + theta *B*VO(£7+[0 N]) ... 

+ (1-theta) *«B*V1(£7+[0 N]); 

VO (£7+(1:N-1)) = U\(L\b); 
end 
if nargout==2, P=V0; else P=interpl1(Svec,V0,S,'’spline’); end 


Below you find the numerical results for a down-and-out put with barrier Sp = 16, So = 25, strike 
X = 20, interest rate r = 5%, time to expiry T = 6/12, volatility ø = 40%, and dividend yield q = 0. We 
compute also the price for a European put. Fig. 4.12 compares the down-and-out put with a European 
put. 
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FIGURE 4.12 Price of down-and-out put (lower line) and European put (upper line) as a function of the price of the 
underlying at the beginning of the option’s life. 


SQ = 25; X = 20; r = 0.05; T = 6/12? sigma = 0.407 q = 0; 
Smin = 16; Smax = 40; N = 100; M = 50; theta = 1/2; 


P = FDDownOutPut (S0O,X,r,q,T,sigma,Smin, Smax,M,N, theta); 
fprintf(’\n Down-and-out put: %5.3f\n’,P); 


Down-and-out put: 0.162 
Smin = 0; P = FDM1DEuPut(S0,X,r,q,T,sigma, Smin, Smax,M,N, theta); 
fprintf(’\n European put: %5.3f\n’,P); 

European put: 0.649 


4.3.5 Coordinate transformation of space variables 


In the discussion in Section 4.3.4 we saw that the precision of the Crank—Nicolson method dimin- 
ishes around the exercise price but can be improved with a finer space grid. However, there are two 
arguments against fine space grids. First, it increases the computational complexity as the size of 
the linear system grows, and second, in regions farther away from the strike price, a fine grid is not 
needed as the precision is already very good. 

Even though in our opinion the precision of the Crank—Nicolson method is sufficient for most 
practical applications, one could think to use an irregularly spaced grid, dense around the exercise 
price and coarser when moving away. In the following, we present some simple ideas on how to 
produce variable space grids by means of coordinate transformations in the Black-Scholes equa- 
tion. 

We consider a transformation of the underlying S, 


S=S(x), (4.8) 


substitute the transformation given in Eq. (4.8) in the pricing equation, and solve it by applying 
a uniform grid on the variable x. As an example, we consider the Black-Scholes equation V; + 
> ao? S? Vss +(r—q) S Vs—r V = 0, where we replace S by S(x), compute the first-order derivative 
of V(S(x)) with respect to x, 


y, oy a3 VsJ(x) and V. Me 
— =i XxX pm > 
= asas S= T(x) 


and the second-order derivative 


i Ll 3 (V 
=< (Vs) = VssJ0) and vss= 5 5) 


— 
Vs 


Finite difference methods Chapter | 4 77 


We can now write the Black-Scholes equation as a function of x, 


252(x) a / V, 
ya (x) ( x 


) Ny, 2330 
WG) be Gy) ta 


and use the following approximations: 


Jet B)t+I(e— F) 


I(x) & > 
I(x +42) * serena a) 
Vix V(x+A,)-— V(x—-A,) 
e 2 Ax 
ð ( V, j~ ViwtAx)— V(x) Vix) — VA) 
ax \ I(x) J(x+ 44) 22 J(x — 44) a? f 


As done previously for the implicit method, we use a forward difference to approximate V; and 
obtain 


202 
la Da Si | visi,j — Vij Vij — Vi-1,j ae ao Vit j ViN og 0 
At QJ; 2 gcd A2 J, _ı Ji 2 Ax " 

it; io 


Substituting J, 1 = pie ee = siete and J; = ss we obtain 
x 


1 
i-z x 


202 
o S At Vi+l,j— Vij Vij — Vi-1,j 
vi, j+1 — Vij + 
Si+1 — Si-1 \ Si+1— Si Si — Si—1 
I t 
+(r—-q) (vi+1,j — Vi-1,j)— Arr vij =O. 
Si+1 — Si—1 
. SAY 0? Siaj _ o?Siai 
In order to compact the presentation, we denote a; = Sao bi = Saye — ges? and 
ei = (r — q)a;, and we get thei = 1,..., N — 1 equations 


Vi jt1 — Vij + Oi(Vi41,7 — Vij) — ci(vij — Vi-1,f) + ei(i+1, j — Vi-1,))— Ar rvij = 0. 


Collecting the identical terms, we get the following expression: 


Ui,j+1 — Vij + (Ci — ei)vi—1,j + (bi — Ci— Arr) vi + (bi +ei)vi+1,j =9, 
— ———— < 
di mj Ui 
which is very similar to Eq. (4.3) except for the forward difference approximation used for the time 


derivative and the difference in the definition of the coefficients. The corresponding matrix notation 
is 
ni ; ; 
v — v + Pvt =0, 


where matrix P is the one defined on page 69. We apply the same partition to P as done previously 
and compute the matrices Am, Ap, and B needed to implement the 6-method. 
We briefly discuss two potential transformations. The first is a logarithmic transformation 


S=Xe*, 
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FIGURE 4.13 Coordinate transformations for N = 29 and X = 20. Left panel: Hyperbolic sine function with A = 10 and 
p = 0.4. Right panel: Logarithmic transformation with 7 = 2. 


where X is the exercise price and x € [—J, I] is defined in a symmetric interval around zero of 
length 2 7. The inverse of the transformation is x = log(S/X). The right panel in Fig. 4.13 shows 
the corresponding grid in the S variable. We observe that the grid becomes finer when approaching 
S = 0 and not around S = X. 

The second transformation uses a hyperbolic sine function 


S=X (exp(ux c) — exp(—(ux c)))/2a +X, 


with x € [0, 1], à a shape parameter defining c = log(A + VA2 + 1), and u = N c/(Lp N] + 1/2). 
N is the number of grid points, and p defines the value of x corresponding to X. The left panel in 
Fig. 4.13 shows that the grid computed with this transformation is finest around S = X. 

Finally, Fig. 4.14 compares the absolute error with respect to the Black-Scholes solution for 
prices computed with the logarithmic transformation and the transformation using the hyperbolic 
sine function for increasing numbers of grid nodes N. 
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FIGURE 4.14 Absolute errors for the hyperbolic sine transformation (circles) and the logarithmic transformation (plus 
signs). 
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4.4 American options 


An American option can be exercised at any time. Therefore, it follows that the price must satisfy 
the following PDE 


V, + 40°S? Vss+(r—q)SVs—rV = 0, (4.9) 
and in order to exclude arbitrage opportunities, the option price at any time must satisfy 
V(S,t) > max(X — S(t), 0). (4.10) 
Terminal and boundary conditions for the put are 


V(S, T) =max(X — S(T),0) and V(O,t)=X, lim V(S,t)=0. 
S—0o 


We denote S,(t) as the largest value of S, at time t, for which 
V (Spt), t) =X — Se), 


which then defines the early exercise boundary, that is, the price of the underlying at time ¢ for 
which it is optimal to exercise the option. 
The solution of Eq. (4.9) is formalized with the following equation and inequalities’ 


Las(V) (V(S,t) — V(S, T)) =0 
Lys(V) =O and (V(S,t)—V(S,T)) =0 


which constitutes a sequence of linear complementary problems. Using the notation introduced for 
the formalization of the 6-method, this translates into 


(Am vg — b/) x(ug — vf) =0, 
(4.11) 
Am vå > bi and vå 


IV 


M 
Vo > 


where v is the payoff, Am =I — 0A, Ap =1+(1—6)A, and bi = Ap vů"! +0 B vlo + (1 — 
0) B vig ' The notation x designates an element-by-element product. 

The solution for the two inequalities in (4.11) can be approached by solving the linear system 
Amx = b/ with an iterative method and, if necessary, truncating the elements of vector x as soon 
as they are computed in the following way: x; = max(x;, v?). This can be achieved with a mod- 
ification of the SOR method presented in Section 3.2.1. This modified method, called projected 
successive overrelaxation (PSOR), for the solution of time step j is given in Algorithm 14. 


Algorithm 14 Projected successive overrelaxation (PSOR). 


1: give starting solution x € R” 

2: while not converged do 
3 for i = 1:n do 

4 Xj = (b! — Aj 1:n X)/Gii + Xi 
5: x; = max(x;, v?) 
6 end for 

7: end while 

8: vå =x 


Algorithm 14 is implemented with the MATLAB code PSOR .m. The input arguments are the 
starting solution x1, the matrix A and vector b corresponding to period j, and the payoff. It is 


7. Lgs(V) S Vi + 1 o? s2 Vss +r S Vs —r V is the Black-Scholes differential operator. 
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also possible to overwrite the default option for the relaxation parameter w, the tolerance for the 


convergence test, and the maximum number of iterations. 


Listing 4.8: C-FiniteDifferences/M/./Ch04/PSOR.m 


1| function x1 = PSOR(x1,A,b,payoff,omega,tol,maxit) 

2|% PSOR.m -- version 2004-05-03 

3|% Projected SOR for Ax >= b and x = max(x,payoff) 

4| if nargin == 4, tol = le-4; omega = 1.7; maxit = 70; end 
5}it = 0; N = length(x1); x0 = -x1; 

6) while not( converged(x0,x1,tol) ) 

7 x0 = x1; 

8 for i = 1:N-1 

9 x1(i) = omegax( b(i)-A(i,:)*xl ) / A(i,i) + x1(i); 
10 x1l(i) = max(payoff(i), x1(i)); 

11 end 

12 it=it+1; if it>maxit, error(’Maxit in PSOR’), end 

13| end 


Another approach for the solution of the two inequalities in (4.11) consists in computing first 


the solution x of the linear system 
Amx = bi 
with a direct method, and then deriving vů by truncating x, 


vå = max (x, vě) ; 


with max a component-wise operation in order to satisfy the inequality (4.11). This defines the 
explicit payout (EP) method. Algorithm 15 summarizes the procedure which terminates with the 


: 0 
solution vo. 


Algorithm 15 Explicit payout method (EP). 


1: vă = max(X — S, 0) 
2: for j =M—1:-1:0do 

3: bi = Apvdt! +o Bu, +0 —6) Bud! 
4: solve Am vh =p/ 

5 vg =max(vd, ue) 

6: end for 


The complete procedure for pricing American put options either with the projected SOR 
(PSOR) or the explicit payout (EP) method is implemented with the MATLAB code 


FDM1DAmPut .m. 
Listing 4.9: C-FiniteDifferences/M/./Ch04/FDM1DAmPut.m 


1| function [P,Sf] = FDM1DAmPut(S,X,r,q,T,sigma, Smin, Smax,M,N, theta,method) 
2|% FDM1DAmPut.m -- version: 2007-05-28 

3|% Finite difference method for American put 

4| f7 = 1; Sf = NaN(M,1); 

5| [Am,Ap,B,Svec] = GridU1D(r,q,T,sigma, Smin, Smax,M,N, theta); 
6|VO = max(X - Svec,0); 

7| Payoff = VO(£7+(1:N-1)); 

8|% Solve linear system for succesive time steps 

9| [L,U] = lu(Am); 

10| for j = M-1:-1:0 

11 v1 = VO; 

12 VO(£7+0) = X - Svec(f£7+0); 

13 b = Ap*V1(£7+(1:N-1)) + theta *B*V0(f£7+[0 N])... 

14 


+ (1-theta) *B«V1(£7+[0 N]); 
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15 if strcmp (method, ’PSOR’ ) 

16 VO(£7+(1:N-1)) = PSOR(V1(£7+(1:N-1)),Am,b, Payoff); 
17 p = find(Payoff - V0O(£7+(1:N-1))); 

18 S£(£7+j) = Svec(p(1)); 

19 elseif strcmp (method, ’EPOUT’) % Explicit payout 

20 solunc = U\(L\b); 

21 [VO (£7+(1:N-1)),Imax] = max([Payoff solunc],[],2); 
22 p = find(Imax == 2); 

23 i = p(1)-1; % Grid line of last point below payoff 
24 S£(£7+j) = interpl(solunc(i:i+2)-Payoff(i:i+2),... 
25 Svec (£7+(i:i+2)),0,'pchip’); 

26 else error(’Specify method (PSOR/EPOUT) ’); 

27 end 

28| end 

29| if ~isempty(S) 

30 P = interpl (Svec,V0,S,'’spline’); 

31| else 

32 P = [Svec(2:end-1) VO(£7+(1:N-1))]; 

33| end 


We first discuss the results for the PSOR method with an example in which X = 50, r = 0.10, 
T =5/12,0 = 0.40, and q = 0. For the grid parameters, the values are Smin = 0, Smax = 80, N = 40, 
M = 50, and 6 = !/2. 

If for the input argument of FDM1DAmPut .m we specify a value for the underlying S, the 
function returns the corresponding option price in the output argument P. If S is specified as an 
empty array [], P will be a matrix where column one holds the value of the underlying on the 
grid and column two holds the corresponding option prices. Below is computed the option price 
corresponding to the underlying with price $ = 55. Note that the starting solution provided to PSOR 
in Statement 16 corresponds to the solution of the preceding time step. 


S = 555 X = 50; £ = 0.10; T = 5/12; sigma = 0.40; q= 0; 
Smin = 0; Smax = 80; N = 40; M = 50; theta = 1/2; 
[P,S£] = FDM1DAmPut(S,X,r,q,T,sigma, Smin, Smax,M,N, theta, ’PSOR’); 


P = 
2.5769 


The early exercise boundary is defined by the borderline between option prices above and below 
the options payoff. This is illustrated in the left panel in Fig. 4.15 in which we reproduce the grid 
for the previous example. The nodes in which the option price is above the payoff are in dark. 

We observe that, due to the discrete character of the grid, the early exercise boundary is a step 
function. In Fig. 4.16, we map the boundary to the corresponding value of the underlying. The area 
below the curve is called the stopping region, and the area above is the continuation region. 

With the explicit payout method, all elements of vå are available when the solution is truncated. 
This makes it possible to proceed with an interpolation between the elements above and below 


0 M-1 
FIGURE 4.15 Finite difference grid for Smin = 0, Smax = 80, N = 40, M = 50, and 0 = L, Left panel: Dark nodes indicate 
option price greater than payoff. Right panel: Option prices and early exercise boundary. 
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FIGURE 4.16 Early exercise boundary (same setting as in Fig. 4.15). 


the payoff. We use one point below the payoff and two points above for interpolation. This is 
implemented in Statements 22-25. The left panel in Fig. 4.17 illustrates the procedure, and the right 
panel shows the early exercise boundary computed from the interpolated unconstrained solution of 
the explicit payout method. We observe that the function no longer contains steps. 


50 T T T T 
5 è 
T 45 
P 
5 S 
2 
3 40 
0 Stopping region 
i 1 35 1 1 1 1 
S(i) S(i+1) S(i+2) 0 0.1 0.2 0.3 0.4 0.5 


t 


FIGURE 4.17 Left panel: Grid nodes used for interpolation. Right panel: Early exercise boundary resulting from interpo- 
lation of the solutions obtained with the explicit payout method. 


A finer grid produces smoother early exercise boundaries. Fig. 4.18 shows the boundaries for 
both methods with a finer space grid. Prices are not significantly different, but execution times are.’ 

Matrix Am has to be factorized only once (Statement 9 in FDM1DAmPut .m), and MATLAB 
solves the sparse linear systems very efficiently.” 


n 
ll 
on 
UI 
zal 


= 50; r = 0.10; T = 5/12; sigma = 0.40; q = 0; 

Smin = 0; Smax = 150; N = 300; M = 50; theta = 1/2; 

tig 

P, Sf1] =FDM1DAmPut (S, E,r,q,T, sigma, Smin, Smax,M,N,theta, ’PSOR’); 
printf(’\n PSOR: P = %5.3f (%4.2f sec)\n’,P,toc); 


(ams) 


Hh 


PSOR: P = 2.593 (0.99 sec) 


tig 
P,S£2]=FDM1DAmPut (S, E,r,q,T, sigma, Smin, Smax,M,N,theta, 'EPOUT'); 
printf ('\n EPout: P = %5.3f (%4.2f sec)\n’,P,toc); 


(ams) 


Hh 


EPout: P = 2.591 (0.02 sec) 


8. Execution time for the explicit payout method is in the order of 1/10 of a second. 
9. In the following example, we present a modified implementation in which we use the algorithms for solving tridiagonal 
systems introduced in Section 3.3.1. 
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FIGURE 4.18 Computation of early exercise boundaries with finer grid (Smax = 150, N = 300, and M = 50). PSOR (left 
panel) and EP (right panel). 


In the code PSOR .m, the default value for the relaxation parameter w has been set to 1.7. This 
value has been computed using a grid search. The coefficient matrix Am needed for the search is 
computed in FDM1DAmPut .m by GridU1D.m. In order to make this matrix available we suggest 
setting a breakpoint in FDM1DAmPut .m and saving the matrix with the MATLAB command save 
tempAm Am. We then use the function OnegaGridSearch.m. Fig. 4.19 illustrates the search 
for optimal omega. 


load tempAm 

A = full (Am); 

D = diag(diag(A)); L = tril(A,-1); U = triu(A,1); 
winf = 0.9; wsup = 1.9; npoints = 21; 

[w,r] = OmegaGridSearch(A,winf,wsup,npoints) ; 
fprintf(’\n w = %5.3£ h = %5.3£\n’,w,r) 


w = 1.700 r = 0.718 


1 T T T T 


O 
0.95 t ©2009, J 


0.75f O J 


0.7 L L 1 L 
0.9 1 1.7 1.9 


FIGURE 4.19 Grid search for the optimal omega in the PSOR algorithm. 


Example 4.2 


The sparse linear system in the explicit payout method has beensolved with MATLAB’s built-in sparse 
matrix-handling capabilities. However, we have presented in Section 3.3.1 algorithms and code de- 
signed to exploit the sparsity of tridiagonal linear systems. This code can also be used for the explicit 
payout method and is implemented in FDM1DAmPutsc.m. The function GridU1Dsc .m called in State- 
ment 5 computes matrices Am and A, that store the three diagonals of our system columnwise as shown 
hereafter 
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In Statement 10 of FDM1DAmPutsc.m, we compute the vectors £ and u of the LU factorization, and 
in Statement 14, the function buildb.m computes the right-hand-side vector of the linear system to 
solve. Given the storage scheme used for the tridiagonal matrix, the matrix vector product is coded in 


buildb.mas 
Ai [1 2 3] Xļi—1 i i+1] i=1,...,N. 


Back- and forward substitutions are executed in Statement 16 by solve3diag. 


Execution is about four times slower compared with the built-in sparse code of MATLAB 
but still remains in the order of tenths of seconds for problem sizes relevant in our applica- 


tions. 


Listing 4.10: C-FiniteDifferences/M/./Ch04/FDM1DAmPutsc.m 


— | 


| 


function [P,Sf] = FDM1DAmPutsc(S,X,r,q,T,sigma, Smin, Smax,M,N, theta) 


FDM1DAmPutsc.m -- version: 2010-10-30 

Finite difference method for American put (sparse code) 

£7 = 1; Sf = NaN(M,1); 

[Am,Ap,B,Svec] = GridU1Dsc(r,q,T,sigma, Smin, Smax,M,N, theta); 

vO = max(X - Svec,0); 

Payoff = VO(f£7+(1:N-1)); 

% Solve linear system for succesive time steps 

p = Am(2:N-1,1); d = Am(:,2); q = Am(1:N-1,3); 

[1l,u] = lu3diag(p,d,q); 

for j = M-1:-1:0 

Vl = VO; 

VO(£7+0) = X - Svec(£7+0); 

b = buildb(Ap,B,theta,V0,V1); 

% Explicit payout 

solunc = solve3diag(1l,u,q,b); 

[VO (£7+(1:N-1)),Imax] = max([Payoff solunc],[],2); 

p = find(Imax == 2); 

i = p(1)-1; % Grid line of last point below payoff 

Sf (f7+j) = interpl(solunc(i:i+2)-Payoff(i:i+2),... 
Svec(£7+(i:i+2)),0,’cubic’); 


% 
% 


end 
if ~isempty(S), 

P = interpl(Svec,V0,S,'’spline’); 
else 

P = [Svec(2:end-1) VO(£7+(1:N-1))1]; 
end 


Listing 4.11: C-FiniteDifferences/M/./Ch04/GridU 1Dsc.m 


function [Am,Ap,B,S] = GridU1Dsc(r,q,T,sigma, Smin, Smax,M,N, theta) 
GridU1lDsc.m -- version 2011-05-14 

Uniform grid in original variables (sparse code) 

£7. = 1. dt, = P/M 

S = linspace(Smin,Smax,N+1)’; 

ds = (Smax - Smin) / N; 

(dt*sigma*2*S(£7+(1:N-1)).%2) ./ (2*dS*2); 

(dt* (r-q) *S(£7+(1:N-1))) / (2*dS); 

a - b; m= - 2*a - dt>r; u = a+ b; 


% 
% 


I 


VATY 
I 
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ll| tta = theta; 

12}Am = [[0; -tta*d(2:N-1) ] 1-tta»xm [ -ttax*xu(1:N-2) ;0]]; 
13;}Ap = [[0; (1-tta)*d(2:N-1)] 1+(1-tta)*«m [(1-tta) *u(1:N-2);0]]; 
14/B = [d(1) u(N-1)]; 


Listing 4.12: C-FiniteDifferences/M/./Ch04/buildb.m 


function b = buildb(Ap,B,theta,Vv0,V1) 


1 

2|% buildb.m -- version 2010-10-30 

3|% Builds right-hand side vector for theta method 

4|n = size(Ap,1); 

5|}b = zeros(n,1); £7 = 1; N = n+1; 

6| for i = 1:n 

7 b(i) = Ap(i,:)*V1(£7+(i-1:1i+1)); 

8| end 

9} b(1) = b(1) + theta*B(1)*VO(£7+0) + (1-theta) *B(1)*V1(£7+0); 

10| b(n) = b(n) + theta*B(2)*VO(£7+N) + (1-theta) *B(2)*V1(£7+N) ; 
Example 4.3 


The code FDM1DAmPut .m can be easily adapted to price a variety of options. We illustrate this with an 
American strangle 10 which combines a put with exercise price X; with a call with exercise price X2 
such that X; < X2. This then corresponds to the following payoff (or terminal conditions, see dotted line 
in the graph of Fig. 4.20) 


max(X;—S,0) if S<X2 


V(S,T) = 
max(S—X>,0) if S> xX, 


and boundary conditions 


V (0, t) = X; e "T9 
VS, D =(5—-Xje TT- S$ > Xp. 
In the modified code FDAmStrangle.m, the terminal conditions are coded in Statements 6-9 and the 


boundary conditions in Statements 14 and 15. All remaining code, except the additional exercise price 
in the input arguments, is identical with FDM1DAmPut .m. 


Listing 4.13: C-FiniteDifferences/M/./Ch04/FDAmStrangle.m 


1] function [P,Sf] = FDAmStrangle(S,X1,X2,r,q,T,sigma,Smin,Smax,M, N,theta,method) 
2|% FDAmStrangle.m -- version 2010-11-07 

3|% Finite difference method for American strangle 

4| £7 = 1; dt = T/M; Sf = NaN(M,1); 

5| [Am,Ap,B,Svec] = GridU1D(r,q,T,sigma, Smin, Smax,M,N, theta); 
6|VO = max(X1 - Svec,0); 

7|I = find(Svec >= X2); 

8| VO(I) = Svec(I) - X2; 

9| Payoff = VO(£7+(1:N-1)); 

10|% Solve linear system for succesive time steps 

11| [L,U] = lu(Am); 

12| for j = M-1:-1:0 

13 Vi = WO; 

14 VO(£7+0) = exp(-r*(T-j*dt))*(X1 - Smin); 

15 VO(£7+N) = exp(-r*(T-j*dt))*(Smax - X2); 

16 b = Ap*v1(£7+(1:N-1)) + theta *B»V0(f7+[0 N])... 

17 + (1-theta) *B*V1(£7+[0 N]); 

18 if strcmp (method, ’PSOR’ ) 

19 v0(£7+(1:N-1)) = PSOR(V1(£7+(1:N-1)),Am,b, Payoff); 


10. For other approaches to price American strangles, see, for instance, Chiarella and Ziogas (2005). 
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20 p = find(Payoff - VO(£7+(1:N-1))); 

21 Sf (f7+j) = Svec(p(1)); 

22 elseif strcmp (method, 'EPOUT’) % Explicit payout 

23 solunc = U\(L\b); 

24 [VO (£7+(1:N-1)),Imax] = max([Payoff solunc],[],2); 
25 p = find(Imax == 2); 

26 i = p(1)-1; % Grid line of last point below payoff 
27 S£(£7+j) = interpl(solunc(i:i+2)-Payoff(i:i+2),... 
28 Svec(£7+(i:i+2)),0,’pchip’); 

29 else error(’Specify method (PSOR/EPOUT) ’); 

30 end 

31) end 

32| if ~isempty(S), 

33 P = interpl(Svec,V0,S,'’spline’); 

34| else 

35 P = [Svec(2:end-1) V0(f7+(1:N-1))]; 

36| end 


Fig. 4.20 shows the price of an American strangle with X; = 100, X2 = 150, r = 5%, T = 1, o = 40%, 
and q = 10%. The grid was specified with Smin = 0, Smax = 250, N = 200, M = 100, 0 = 1/2, and the 
explicit payout method has been selected. 


SO = 165; X1 = 100; X2 = 150; r=0.05; T=1; sigma=0.40; q=0.10; 
smin = 0; Smax = 250; N = 200; M = 100; theta = 1/2; 


[P,Svec] = FDAmStrangle(S0,X1,X2,r,q,T,sigma,Smin,Smax,... 
M,N,theta,’EPOUT’ ); 
fprintf(’\n American strangle: %5.3f\n’,P); 


American strangle: 31.452 


100 


31.45 


0 100 150 165 250 
FIGURE 4.20 Price of American strangle (dashed line is the payoff). 


Appendix 4.A A note on MATLAB’s function spdiags 


MATLAB provides a very convenient way to construct sparse diagonal matrices. This is achieved 
with the function spdiags. We briefly explain its usage. Consider the command D = 
spdiags(B,d,m,n); B is a matrix in which diagonals are stored column by column (all of 
same length). The argument d specifies the location of the diagonals. As an example, take 


i 61l 
27122 
B=|3 813 and d=[-201]. 
4 914 
5 10 15 
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Then, the diagonals are placed into a matrix at rows as indicated in vector d. The number of columns 
is defined by the number of rows in B. This is shown in the leftmost figure in the picture below. 


e e 
. . 
e e 
+2 +2 


-2 |1 8 |14 -2 |1 8 |14 
-3 2 9 115 -3 2 9 |15 
-4 3 10 -4 3 10 
. 4 . 4 
: 5 : 5 


The arguments m and n in the function call specify the size of a submatrix from this matrix, 
where m is the number of rows starting from row 0 and going down (0, —1, —2, ...) and n is the 
number of columns starting from the leftmost column (column 1). The submatrix can be rectangu- 
lar, and n, the number of columns, does not have to match the length of the diagonal. The figure at 
right shows the submatrix corresponding to m = 5 and n = 5. We note that the last two elements 
in the first diagonal and the first element in the third diagonal are not included in the submatrix. 
Therefore, matrix B could be specified as 


1 6 aw 
2 7 12 
B=] 3 8 13 
nn 9 14 


NaN 10 15 
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5.1 Motivation 


Binomial trees serve as an intuitive approach to explain the logic behind option pricing models and 
also as a versatile computational technique. This chapter describes how to implement such models. 
All algorithms are given in pseudocode; some sample codes are given for MATLAB® and R. We 
start with the Black-Scholes option pricing model, which gives us a natural benchmark to judge 
the quality of our trees and makes it easy to demonstrate the flexibility of tree-models. 

The binomial model assumes that, over a short period of time ^z, a financial asset’s price S 
either rises by a small amount or goes down a small amount, as pictured below. The probability of 
an uptick is p, while a downtick occurs with probability 1 — p. (Probabilities in this chapter are 
generally risk-neutral ones.) 


O Sup with p 


© Sgown With 1-p 


ce 
At 


> 


Such a model for price movements may either be “additive” (i.e. absolute movements) or “mul- 
tiplicative” (i.e. relative movements), which both refer to units of the observed price. Computa- 
tionally, a multiplicative model can be changed into an additive one by taking logarithms. Two 
objections against the use of additive models are that the price could become negative, and that 
the magnitude of a price change (in currency units) does not depend on the current price. Hence, a 
l-unit move is as likely for a stock with price 5 as for a stock with price 500, which, empirically, is 
not the case. Still, for very short periods of time, an additive model may be just as good as a multi- 
plicative one. In fact, the additive model may be even better at capturing market conventions (e.g., 
prices in certain markets may move by fixed currency amounts) or institutional settings (e.g., when 
modeling central banks’ interest rate setting). Nevertheless, we will discuss multiplicative models; 
the implementation of additive models requires only minor alterations to the procedures described 
below. We will assume that S is a share price, even though the model may be applied to other types 
of securities as well. We want to find the price, and later on the Greeks, of plain vanilla call and 
put options. Now is time 0 and the option expires at T. The time until expiration is divided into M 
periods; with M = 1, we have A;= T, else A;=7/m. The subscript t differentiates the symbol for 
a short time period from the symbol for Delta, A, that will be used later. 


Matching moments 


We start with a tree for a Black-Scholes world, thus we want to match the mean and variance of the 
returns of S in our tree to a given mean and variance o° in a continuous-time world. Let u and d be 
the gross returns of S in case of an uptick and downtick, respectively. With M = 1 and the current 
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price So, we have 


($>) = put 0- pa, (5.1) 

So 

Var (=) = = Var(Sa,) = pu? + (1— p)a? — (pu + (1 — p)d) (5.2) 
So So 


where we have used 
2 
Var(S,) = So (pu? + (1 — p)d*) — So(pu + (1 — p)d) 
a M“ aa 
E(S3,) (E Sa)? 


to obtain Eq. (5.2). E(-) and Var(-) are the expectations and variance operator, respectively. In a 
risk-neutral world with continuous time, the mean gross return is e”“’, where r is the risk-free rate. 
So with Eq. (5.1), 


pu+(1—p)d=e™. (5.3) 
Thus we have an equation that links our first target moment, the mean return, with the tree param- 


eters p, u, and d. In the Black-Scholes model, we have lognormally distributed stock prices with 
variance 


Var(S.,) = gem E + — 1) = Sp(e eters — erry, 
Dividing by sê and equating to Eq. (5.2), we obtain 
pu? + (1 — p)d2 = e” 5+0 4i, Ga 


which links our second target moment, o*, with our tree parameters. We now have two equa- 
tions, (5.3) and (5.4), from which we try to infer the values for p, u, and d, hence there are infinitely 
many possible specifications for the tree. Different authors introduce different restrictions to obtain 
a solution. One possible (and probably the best-known) assumption, made by Cox et al. (1979), is 
that 


ud=1. 


We obtain 


e^ d 


p= u—d 


from Eq. (5.3); for u and d, Cox et al. (1979) suggest the approximate solutions 


u = e77 ^t 


’ 


d=e V4: 


These parameter settings ensure that Eq. (5.3) is satisfied exactly and Eq. (5.4) approximately (with 
the approximation improving for decreasing A; and becoming exact in the limit). See Jabbour et al. 
(2001) for a discussion of these and other possible parameter settings. 


5.2 Growing the tree 


When we increase the number of periods, A;= T/M becomes smaller. The following figure shows 
the resulting prices for M = 4. 
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The right side of the figure gives a convenient form to visualize and work with the resulting tree. 
The node (i, j) represents the asset price after i periods and j upticks, that is 


Si, j = Sui dii 
A point in time on the grid is labeled i, which translates into a point in “real” time as 
i, 
t = a =l At : 


Note that there is also an equally shaped tree for the option prices in which every node corresponds 
exactly to a node in the stock price tree. Furthermore, in this tree here, the parameters (u, d, p) stay 
unchanged at every node. Thus, the volatility (i.e. the volatility that is inserted in Eq. (5.2) to obtain 
the model parameters) is constant, as in the Black-Scholes model. (The volatility at a particular 
node is also called “local volatility.”) 


5.2.1 Implementing a tree 


We start with computing the current price Co of a European call; Algorithm 16 gives the procedure 
for given values of the spot price S, the strike X, the volatility o, the risk-free rate r, and time to 
maturity T. 


Algorithm 16 European call for S, X, r, o, T, and M time steps. 


= 


initialize A= T/M, S® =S, v=e7"^ 


2: compute u =ef V^r, d=1/u, p=(ř^ —d)/(u—d) 

3; SOD = sO gM 

4: for 7 =1:M do 

5: st = sn u/d # initialize asset prices at maturity 
6: end for 

7: for j =0:M do 

8: C ‘ie = max(si” — X,0) # initialize option values at maturity 
9: end for 

10: fori = M — 1 : —1 : 0 do 

11: for j=0:i do 

12: ce =v (p Cr +(1— p) Cy) # step back through the tree 
13: end for 

14: end for 


15: return cl 


Note that we have written cr and SP in the algorithm. In the actual implementation, we do not 
store C and S as matrices, but as vectors, which are overwritten at every time step. The subscripts 
indicate the position in the vector; the superscripts only remind us that both vectors depend on the 
time step. 
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For the price of a put, we only have to replace max(S\"? — X, 0) in line 8 by max(X — Cha , 0). 
An implementation in MATLAB is provided in the function EuropeanCall. 


Listing 5.1: C-BinomialTrees/M/./Ch05/EuropeanCall.m 


function CO = EuropeanCall(S0,X,r,T,sigma,M) 
% EuropeanCall.m -- version 2010-12-28 

% compute constants 

£7 = 1; dt =T/M; v = exp(-r * dt); 

u = exp(sigmaxsqrt(dt)); d= 1 /u; 

p = (exp(r * dt) - d) / tu - d); 


oe 


initialize asset prices at maturity (period M) 
S = zeros(M + 1,1); 


S{£7+0) = S0 * d^M; 
for 3 = 13M 

S(£7+j3) = S(£7+j - 1) * u/ d; 
end 


% initialize option values at maturity (period M) 
C = max(S - X, 0); 


% step back through the tree 
for i = M-1:-1:0 


no oon 
OWAANDMPWNF TOAANAADUN WN 


20 for j = 024 

21 C(£7+j) = v x (p *« C(£7+53 + 1) + (1-p) * C(£74+5)); 
22 end 

23| end 

24| CO = C(£7+0); 


A few remarks: Several of the loops in Algorithm 16 count from or to 0, which does not con- 
form well with how vectors and matrices are indexed in MATLAB or R. A helpful “trick” is to 
use an offsetting constant (£7). Adding it to an iterator allows us to quickly code an algorithm 
while reducing errors (later on, one may dispose of the constant). It is not necessary to store the 
complete matrices of stock and option prices; it is enough to update just one vector that is updated 
while stepping through the tree. Note in particular that we do not need to update the stock prices, 
only the option prices are computed afresh in every time step. The algorithm could be acceler- 
ated by also precomputing quantities like (1 — p) or, since we discount in every period by e~"*', 
by leaving out the periodic discount factor v and instead discounting CO by e~’?. Such changes 
would, however, only marginally improve the performance while obscuring the code; thus, we omit 
them. 


5.2.2 Vectorization 


We do not exploit any special structure in the MATLAB code (i.e. we do not vectorize), even though 
the inner loop could be avoided by realizing that for every time step 7, we have 


Cii Ci+1,i+1 Citi 
Ci,i-1 Ci+1,i Ci+1,i—1 
e^ =p : +(1— p) ; 
Cit Ci+1,2 Ci+1,1 
Ci,0 Ci+1,1 Ci+1,0 
v y+ v- 


which is illustrated below for i = 2. 
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oO 
$ 
O O 0 0 90 
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Some testing showed that implementing this approach does not substantially improve perfor- 
mance (in current versions of MATLAB), even though in older versions of MATLAB it did (see 
Higham, 2002). Hence, we keep the double-loop structure. The same does not hold for R, in which 
nested loops are still slower than the vectorized version. The function EuropeanCa11 is included 
in the NMOF package. 


> EuropeanCall 


function(S0O, X; f; tau, sigma, M = 101) { 


## compute constants 

£7 <- 1 

dt <- tau / M 

v <- exp(-r * dt) 

u <- exp(sigma * sqrt (dt) ) 

d <- 1 /u 

p <- (exp(r * dt) - d) / (u - d) 


initialise asset prices at maturity (period M) 
S <- numeric(M + 1) 
S[f7 + 0] <- SO « d^M 

for (j in 1:M) S[f7 + j] <- S[f7 +j - 1] xu/d 


initialise option values at maturity (period M) 
C <- pmax(S - X, 0) 


step back through the tree 
for (i in seq(M - 1, 0, by = -1)) 
C <- v * (p * C[(1+f7):(i+1+f7)] + (1-p) * C[(0+£7):(i+£7)]) 


Cc 
} 


<environment: namespace :NMOF> 


The prices at period M could be initialized by the more efficient vectorized command S <- SO 
* u^(0:M) » d*(M:0). 


5.2.3 Binomial expansion 


Algorithm 16 uses only the stock prices at T. These prices are, in turn, computed from the current 
price, Soo. Since the number of stock paths reaching final node j is given by 


^ = M 
j) (M-j)!j! 


we can write 
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M M 
Cop=e >” ( A Ja — p)” ™*Cux (5.5) 
k=0 


where Cm ų = max(u*d™-*S — X, 0) is the payoff of the call option, evaluated at the respective 
final node. There is, in fact, no need to sum over all end nodes, but it suffices to select those 
where the option expires in the money. Such an implementation of a “tree” loses the possibility of 
including early exercise features (see below). It demonstrates the flexibility of the binomial method, 
though, since nothing limits us to plain vanilla payoffs max(S — X, 0) (or max(X — S, 0) for the 
put). An example may be a “power option” whose payoff (for the call) is given by 


max(S; — X, 0) 


where z is a real number (usually an integer) greater than one. 

A straightforward implementation of Eq. (5.5) may lead to an overflow since, as the number of 
time steps grows, ever larger integers are required for the binomial coefficient (see Higham, 2002, 
for a detailed discussion of ways to circumvent these problems in MATLAB; see Staunton, 2003, 
for Excel experiences). The following MATLAB and R implementations, both named European- 
Cal1BE, are based on Higham (2002). 


Listing 5.2: C-BinomialTrees/M/./Ch05/EuropeanCallBE.m 


1| function CO = EuropeanCallBE(S0,X,r,T,sigma,M) 
2|% EuropeanCallBE.m -- version: 2011-05-15 

3|% compute constants 

4/dt =T / M; 

5 = exp(sigmaxsqrt (dt)); d= 1 fw: 

6|p = (exp(r » dt) - d) / (u - d); 

7 

8|% initialise asset prices at maturity (period M) 
9|C = max(SO*d.*((M:-1:0)’).*u.*((0:M)’) - X,0); 
10 

11|% log/cumsum version 

12| csl = cumsum(log([1; [1:M]’‘])); 

13| tmp = csl(M+1) - csl - csl(M+1:-1:1) + log(p)*((0:M)’) + ... 
14 log(1-p)*((M:-1:0)’); 

15| CO = exp(-r*T) *sum(exp (tmp) .*C); 


> EuropeanCallBE 


function(S0O, X; r, tau, sigma, M = 101) { 


## Compute constants 

dt <- tau /M 

u <- exp(sigma*sqrt (dt) ) 

d <- 1 /u 

p <- (exp(r » dt) - d) / (u - d) 


## initialise asset prices at maturity (period M) 
C <- pmax(S0 * d*(M:0) * u^(0:M) - X, 0) 


## log/cumsum version; see Higham (2002) in References 
csl <- cumsum(log(c(1,1:M))) 
tmp <- csl[M+1] - csl - csl[(M+1):1] + log(p)*(0:M) + log(1-p)+(M:0) 
exp (-r*tau) »sum (exp (tmp) *C) 
} 


<environment: namespace :NMOF> 
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5.3 Early exercise 


Early exercise features are easily implemented in the binomial method; the original paper by Cox 
et al. (1979) advocated the method exactly because of this possibility. What is required is that when 
we compute the new option price at a particular node, we check whether the payoff from exercising 
is greater than the current value of the option. Thus we need to go through the tree; we cannot use 
the binomial expansion. It is also necessary to update the spot price at every node. Algorithm 17 
details the necessary changes in the procedure. 


Algorithm 17 Testing for early exercise: An American put. 


fori = M — 1: —1 : 0 do 


for j =0:ido 
(i) (i) (i) 
cP =v(p Chit- CF”) 
Sas ia 
ce = max(C®, x — s®) 
end for À 
end for 


The MATLAB function AmericanPut implements early exercise. 


Listing 5.3: C-BinomialTrees/M/./Ch05/AmericanPut.m 


1} function PO = AmericanPut(S0,X,r,T,sigma,M) 
2|% AmericanPut.m -- version 2010-12-28 

3)£7 = 1; dt = T / M; v = exp(-r * dt); 

4|u = exp(sigma * sqrt(dt)); d=1/u43 
5|p = (exp(r * dt) - d) / (u - d); 

6 

7|% initialize asset prices at maturity (period M) 
8] S = zeros(M + 1,1); 

9|S(£7+0) = SO *« aM; 

10| for j = 1:M 

11 S(£7+j3) = S(f7+j - 1) * u / d; 

12| end 

13 


14|% initialize option values at maturity (period M) 
15| P = max(X - S, 0); 


17|% step back through the tree 
18| for i = M-1:-1:0 


19 for j = Oe. 

20 P(£7+j) =v » (p * P(f7+j + 1) + (1-p) * P(f7+j)); 
21 S(f7+j) = S(f7+j) / d; 

22 P(£7+j) = max(P(£7 + j),X - S(f7+j)); 

23 end 

24| end 


25| PO = P(f7+0); 


For an R implementation, see the function vanillaOptionAmerican in the NMOF package. 


5.4 Dividends 


Continuous dividends 


If the dividend of the asset can be approximated by a continuous dividend yield q, the algorithms 
need only be slightly changed. In a risk-neutral world, the drift of a dividend-paying asset changes 
from r tor — q; hence, we replace any r with r — q in the tree parameters u, d, and p. 
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Discrete dividends 


Assume the asset pays a discrete dividend, that is a fixed currency amount D, at some future time Tp 
(with 0 < Tp < T). Algorithm 18 describes how to implement the “escrowed dividend model” for 


an American call option. 


Algorithm 18 American call for S, X,r,o, T, Tp, D and M time steps. 


1: initialize A= T/M, SO =S, v=e" 
: computeu=e?V4:, d=1/u, p=(e"4* —d)/(u—d) 
: compute a =S— De"! 


ia, Ss, aM 


10: end for 

11: fori = M — 1 : —1 : 0 do 

12: compute t =iT/M 

13: compute D* = De~” (70-9) 

14: for j=0:i do 

15: Cc =v(p ee +(1—p) ce) 
iG sË = sP yg 

17: ift < Tp 

18: ce = max(C®, s? +D*-X) 
19: else i l , l : 

20: CY = max(cy, SY? — x) 
21: end if , 

22; end for 

23: end for 

24: return gP 


# adjust spot for dividend 


# compute current time 
# compute present value of dividend 


# before dividend 


# after dividend 


The dividend’s present value D* only depends on time (the outer loop), not on the level of S. 


With MATLAB: 


Listing 5.4: C-BinomialTrees/M/./Ch05/AmericanCallDiv.m 


function CO = AmericanCallDiv(S0,X,r,T,sigma,D,TD,M) 
% AmericanCallDiv.m -- version 2010-12-28 

% compute constants 

£7 = 1; dt = T/M; v = exp(-r * dt); 

u = exp(sigmaxsqrt(dt)); d = 1 /u; 

p = (exp(r * dt) - d)/(u - d); 


% adjust spot for dividend 
SO = SO - D * exp(-r * TD); 


% initialize asset prices at maturity (period M) 
S = zeros(M + 1,1); 


S(£7+0) = SO * dM; 
for j = 1:M 

S(£7+3) = S(f7+j - 1) * u / ad; 
end 


% initialize option values at maturity (period M) 
C = max(S - X, 0); 


Ree ee ee ee ee 
OCWADMAPWNFTDOANANIADUNHLWN KE 


is) 
© 


N 
jk 


% step back through the tree 
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22| for i = M-1:-1:0 

23 % compute present value of dividend (dD) 

24 t=T * i / M; 

25 aD = D x exp(-r * (TD-t)); 

26 for j = Qvi 

27 C(£7+5j) =v«*« ( p» C(£7+5 + 1) + (1-p) * C(£7+5)); 
28 S(f7+j) = S(f7+j) / d; 

29 if t > TD 

30 C(£7+3) = max(C(£7 + j), S(£7+j) - X); 

31 else 

32 C(£7+j) = max(C(£7 + j), S(£7+j) + dD - X); 
33 end 

34 end 

35| end 

36| C0 = C(£7+0); 


5.5 The Greeks 


The Black-Scholes option price is a function f of the spot price S, the strike X, the (constant) 
volatility o, the risk-free rate r and time to maturity T. (If the underlier pays dividends, these will 
also affect the value of f.) A Taylor series expansion can be used to estimate the sensitivity of f 
to a given change in one of the parameters. The change in the option price is then a function of the 
(mathematical) derivatives of f, evaluated at the current values of the arguments. These derivatives 
are known as the “Greeks,” and for Black-Scholes the most common ones are available in closed 
form. For the binomial model, the Greeks need in general be approximated by finite differences; 
see Chapter 2. One advantage of this approach over the analytical expressions is that we can use 
a meaningful change in the respective argument of f. For instance, the © may be computed for 
one day hence, which may be more reasonable (and easier to communicate to a trader) than a 
change of infinitesimally small size. Unfortunately, such a straightforward implementation of finite 
differences requires to step through the tree two or even three times. If time is of the essence, some 
Greeks can also be approximated directly from the original tree. (These direct approximations are 
again finite differences.) 


Greeks from the tree 
Delta A 


The A is the change in the option price for a given small change in S. An estimate of A can be read 
directly from the tree: 


Ao,o = s g (5.6) 


where a subscript to A indicates the node for which it is being computed (here, it is the root node, 
i.e. now). The following figure shows the nodes required to compute the value (left is the stock 
price tree, right the option price tree). 
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For an arbitrary node (i, j), we have 


o Cee — Ci+1,j 


Aij 


2 , (5.7) 
© Sit j+ — Si4t,j 


Gamma I 
The I is the rate of change in the A with respect to change in the spot price S. Thus, the current T 
may be approximated as 


A1,1 — Ato A1, — ^10 
ms ot Pye (5.8) 
3 (S2,2 — S2,0) 


where the second possibility uses the midpoint of the prices after two upticks and two downticks, 
respectively. The A-values, following Eq. (5.7), are 


Hence we obtain 


C2,2—-C21 C2,1—C2,0 
S2,2— 82,1 S2,1— 52,0 


5 (S2,2 — S2,0) 


where we have used the second approach from Eq. (5.8) to approximate a small change in S. The 
following figure shows the nodes that are required to compute the I’. 


To0= ; (5.9) 


Theta © 


The © is the change in the price of the option for a small change in T. Since the remaining lifetime 
of the option naturally decreases, the sign is usually switched for the analytical derivative. So for a 
long position in an option, the © is negative. This sign change is not necessary for a finite difference 
method if we let h be a small negative quantity. If the tree fulfills the condition ud = 1 (as is the 
case for Cox et al., 1979), the spot price will arrive at its initial value at time step 2. Thus, a simple 
method to approximate © is 


C2,1 — Co,0 


Om 
0,0 2A, 


(5.10) 


The following figure indicates the required nodes. 


Aj A j 
4 
3 3 
2 2 
1 1 
0 0 
> > 
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If the “centering on the spot” condition ud = 1 is not fulfilled, then © may either be approxi- 
mated by a standard finite difference or, as suggested by Rubinstein (1994), by 


1 
©0,0 = rCo,0 — rS0,0A0,0 — 57 Sol , (5.11) 


which uses the Black-Scholes differential equation. The A and I can be taken from the tree as 
described above, hence one loop through the tree suffices to obtain ©. Some remarks: When com- 
puting the Greeks, the spot price must be updated in the tree. If there is a discrete dividend, it must 
be added back to the spot (as in the case of early exercise). A MATLAB program of this is shown 
below. We chose a European call to make comparison with the analytical Greeks easier. 


Listing 5.5: C-BinomialTrees/M/./Ch05/EuropeanCallGreeks.m 


1| function [C0,deltaE,gammaE, thetaE] = EuropeanCallGreeks(... 
2 S0,X,r, T,sigma,M) 
3|% EuropeanCallGreeks.m -- version 2010-12-28 

4) % compute constants 

5}£7 = 1; dt = T/M; v = exp(-r * dt); 

6|u = exp(sigmassgqrt(dt)); da= 1 fu; 

Tip = (exp(r * dt) - d)/(u - d); 

8 

9|% initialize asset prices at maturity (period M) 

10|S = zeros(M + 1,1); 

11] S(£7+0) = SO * dM; 

12| for j = 1:M 

13 S(£7+3) = S(£7+3 - 1) * u / d; 

14| end 

15 

16|% initialize option values at maturity (period M) 

17}C = max(S - X, 0); 

18 

19|% step back through the tree 

20| for i = M-1:-1:0 

21 for j = Ozi 

22 C(£7+j) =v +» ( p * C(£7+5 + 1) + (1-p) * C(£7+j)); 
23 S(f7+j) = S(f7+j) / d; 

24 end 

25 iE == 

26 %gamma 

27 gammaE = ((C(2+f7)-C(1+f7)) / (S(2+£7)-S(1+f£7)) - 
28 (C(1+£7)-C(O+£7)) / (S(1+£7)-S(0+£7)))/ 
29 (0.5 * (S(2+£7)-S(0+£7))); 

30 Stheta (aux) 

31 thetaE = C(1+f7); 

32 end 

33 if i== 

34 delta 

35 deltaE = (C(1+f£7) - C(0+f7)) / (S(1+£7) - S(0+£7)); 
36 end 

37 if iss 

38 Stheta (final) 

39 thetaE = (thetaE - C(0+f7)) / (2 * dt); 

40 end 

41| end 

42| C0 = C(£7+0); 
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Generating random numbers 
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6.1 Monte Carlo methods and sampling 
6.1.1 How it all began 


Monte Carlo, an area in the Principality of Monaco, is probably best known for its casino—and, 
among quants, for lending its name to a broad range of methods and techniques that involve random 
numbers to crack numerical problems. One of the pioneers for the latter was Stan Ulam who, in the 
1940s, wondered whether there is a practical way of finding the odds for certain outcomes in the 
game of solitaire that do not require cumbersome combinatorics. His idea was to just draw a series 
of samples and evaluate them. Together with John von Neumann, he then extended this approach 
to solve demanding numerical problems in physics using electronic computers.! However, much 
earlier examples for similar experiments can be found. 

French mathematician Georges Louis Leclerc Comte de Buffon considered the following prob- 
lem. Assume that on a piece of paper there are parallel lines with a distance of t. Now, if one drops 
a needle of length £ onto this sheet of paper,” what are the chances that it touches one of the lines? 
After many experiments, he found the relationship 


number of drops £ 


x we 
number of hits t 


1. See Ulam et al. (1947) and Metropolis and Ulam (1949). 
2. Legend has it that he was also seen throwing bread sticks, baguettes, over his shoulder to perform experiments on a 
slightly larger scale. 
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where z = 3.14... is the constant ratio between diameter and circumference of any circle. In his 
honor, this method to estimate z is called Buffon’s algorithm. A variant of this method will be 
introduced later in the text.’ 


6.1.2 Financial applications 


In finance and economics, Monte Carlo methods are used to find approximate solutions to de- 
manding problems. One of their advantages is that the quality of approximation can be improved 
simply by increasing the number of drawn samples—obviously provided that the model is correctly 
specified and the samples have the required statistical properties. Apart from cracking challenging 
numerical problems such as derivative pricing, they are also used to generate scenarios for future 
outcomes, produce values that are substituted for real observations, or evaluate the magnitude when 
factors in complex systems change, to name just a few. 

Key ingredients are the models, the relationships, and the random numbers employed. This 
chapter deals with the latter. Several alternatives exist, here are the main ones: 


e Genuine random numbers in the strictest sense can only be produced by physical and real-world 
processes, such as radioactive decay—or the outcomes from a fair casino roulette table. If the 
efficient market hypothesis holds, one might argue that asset prices, too, contain genuine ran- 
domness; even when properties and underlying principles are understood and known, the actual 
outcome cannot be perfectly predicted and must, therefore, be considered random. (Genuine) 
random numbers coming from computers are rare if not impossible; devices that measure phys- 
ical activities (e.g., background radiation) could do the trick here. 

e Random numbers from software might look genuinely random at first glance, yet they do come 
from a deterministic algorithm. Therefore, they are called pseudorandom numbers. This might 
seem like a limitation, but actually, numbers from a reliable deterministic random number gener- 
ator can have great benefits. The two most important advantages are: (i) they can be reproduced; 
this facilitates controlled and replicable experiments; and (ii) one can accommodate the type of 
distribution and properties. 

e Quasirandom numbers are constructed to maximize some goodness-of-fit measure. Usually, they 
are too “perfect” to look random, but they satisfy the relevant theoretical properties as well as 
possible. These methods can be efficient in areas such as numerical integration but are often 
inadequate for a series of repeated independent experiments. 

e Bootstrap methods resample from a given data set. Their advantage is that no assumptions about 
parametric distributions are required, but they do rely on a representative set of available obser- 
vation. 


For our purposes, pseudorandom numbers will be the most important type. In the sense of 
brevity, the prefix “pseudo” will mostly be dropped, and terms such as “random variables,” “ran- 
dom variates,” and “random numbers” will refer to this group unless stated otherwise. Again, their 
main property is that they are generated by some deterministic processes, but they should not be 
distinguishable from genuine random numbers. A lot of discussion and work has gone into creating 
such processes. Since pseudorandom numbers are one of the key ingredients for simulation, it is 


worthwhile to look at some of the standard algorithms. 


6.2 Uniform random number generators 
6.2.1 Congruential generators 


When it comes to designing a (pseudo)random number generator (RNG), most effort has gone 
into methods for uniform random numbers. The reason for this is simple: it is relatively easy to 
transform uniform variates into other distributions. Therefore, getting the main building block of a 
uniform RNG right is crucial. 

As discussed in Section 2.1, computers operate in discrete rather than continuous space. Hence, 
when b bits are used to represent a number, 2° distinct values can be created. Uniform random num- 


3. Impatient readers are referred to page | 12. 
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bers take any value within the range [£, A] with equal probability. For practical reasons, however, 
uniform numbers within the range [0, 1] are considered. The transformation is straightforward: 


u~ U(0,1) => £+ (h-— £)ju~ U(£, h). (6.1) 


Hence, if one can produce uniformly distributed integers within the range [0, 2° — 1], interpreting 
these values as the mantissa immediately gives us (close enough to) continuous values in the range 
[0, 1). 

A widely used type of RNGs are sequential congruential generators in which each new number 
u; is a function of the previous / numbers and parameters v : 


uj = f(Ui-1,.-.,Uk-1; ©). 
To generate a sequence of integers, a simple case would be a linear congruential generator 
uj = (aui—-1 +c) modm, (6.2) 


where the parameters a, c, and m are integers (see Algorithm 19). Provided the initial value (or 
seed) ug, this method can produce values u; € {0, ..., m — 1}. For example, if m = 10, then possible 
values for u; are the integers from zero to nine. With m sufficiently large, the u;s can then be 
converted into rational numbers f; € [0; 1) according to f; = u;/m. If, for example, m = 10?, we 
would end up with values that could be considered real valued with a precision of p digits after the 
decimal point.* 


Algorithm 19 Linear congruential random number generator. 

1: provide modulus m, multiplier a, 0 < a < m, increment c, 0 < c < m and seed ug 
: fori =1:n do 
uj =(auj_, +c) mod m 


w N 


Listing 6.1: C-RandomNumberGeneration/M/./Ch06/LinearCongruential.m 


1| function u = LinearCongruential(a,b,m,seed,N); 
2|% LinearCongruential.m -- version 2011-01-06 
3|% linear congruential random number generator 
4| % a, C, mM... parameters 

5|% seed ...... seed 

6|% N awe number of samples 

7 

8|% -- initialize 


9| if nargin<5, N = 1; end; 
10}u = nan(N+1,1); 

llju(1) = seed; 

12| if u(1)<1 

13 u(1) = floor(u(1)*m); 
14| end 


16|% -- generate variates 


17| for i = 2: (N+1) 
18 u(i) = mod( (a*xu(i-1) + c) , m); 


21}/u = u/m; 


4. Since computers function with a binary system, a choice of m = 2? is more common. Since several binary digits are 


required to represent one decimal digit, pp has to be higher than its decimal equivalent p; pp = [peoo p] to be precise. 


The effect, however, is the same. 
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TABLE 6.1 Series of pseudorandom numbers from a linear congruential RNG, 
where u; = (au;—4) mod 7 and ug = 1 and different multipliers a. 


a 
113 114 115 116 117 118 119 120 121 122 123 

u 1 2 3 4 5 6 0 1 2 3 4 

uz 1 4 2 2 4 1 0 1 4 2 2 

m | 1 1 6 1 6 6 0 1 1 6 1 

u4 1 2 4 4 2 1 (0) 1 2 4 4 

u5 1 4 5 2 3 6 0 1 4 5 2 

U6 1 1 1 1 1 1 (0) 1 1 1 1 

u7 1 2 3 4 5 6 (0) 1 2 3 4 

ug 1 4 2 2 4 1 0 1 4 2 2 

mm | 1 1 6 1 6 6 0 1 1 6 1 

am d D 4 4 2 1 0 1 2 4 4 

1 T T T T 1 
0.8 È J 0.8 H 
0.6 E 4 0.6 f- 
u, ve ; u; 

0.4 l f J 0.4 H 
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I 


FIGURE 6.1 Random numbers from the randu function, where u; = (65,539 u;—1) mod pL 


An obvious feature of this method is that the sequence will exhibit cycles: since the parameters 
are fixed and only the immediate predecessor is used, a certain realization u;—ı will always have 
the same successor u;. In other words, the sequence will be periodic: when a certain integer is 
drawn for the second time, it will be followed by the same number (and, consecutively, sequence 
of numbers) as in the first time. Even more importantly, with inappropriate parameter values a and 
c, not all of the integers within the range will be drawn; and if an integer is missed out in the first 
round, it will never occur in later rounds either. Both cases can easily be illustrated for small ranges 
(i.e., with small m), see Table 6.1. 

For larger values of m, this issue is harder to spot. Therefore, it took some time to recog- 
nize that the then widely used random number generator suffered from this problem: famously (or 
rather infamously), IBM’s randu routine used the values a = 65,539, c = 0, and m = 231. Plot- 
ting the sequence of the u;s gives a noisy picture (as one might expect from a RNG), and so does 
a two-dimensional scatterplot of u; on the y-axis against u;—ı on the x-axis. A three-dimensional 
scatterplot of subsequent values ui—2, uj_1, and u;, however, reveals that all points sit on one of 15 
two-dimensional planes (see Fig. 6.1). 

This phenomenon is an inherent property of this type of generator: when sequences of generated 
numbers are considered, they seem to arrange themselves in a rather distinct “lattice structure” or 
start exhibiting patterns. Also, setting c = 0 is not a good choice since u; = 0 is an absorbent 
state, where all the subsequent numbers are also zero. Good parameter calibration is, therefore, 
paramount when using this type of RNG. An example for a suitable choice of parameters, provided 
by Jones et al. (2009, p. 332), would be a = 1,664,525 for the multiplier, c = 1,013,904,223 for 
the increment, and m = 2?? for the modulus. Alternatively, other types of generators can also be 


5. A linear congruential generator with c = 0 is also called multiplicative congruential generator. 
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used. Simple variants of the linear version include the multiple recursive generator, 


£ 


uj = y ayes mod m, 
j=l 


the lagged Fibonacci generator, 
uj = (uj—_j + ui—k) modm, 
and the inversive congruential generator, 
ui = (au; +c) modm, 


where uz is the modular multiplicative inverse of u;—ı (i.e., Uy ui = ] mod m). For the 
inversive congruential generator, if m = 1, uo € (0, 1) and noninteger values for a and c results in 
values in the range (0, 1). For the multiple recursive and lagged Fibonacci generators, the previously 
mentioned scaling (or shift of the decimal point) maps the u;s into the (0, 1) range. 

It is important to note that for all approaches, independent multivariate random numbers can be 
generated by using subsequent values. 


6.2.2 Mersenne Twister 


Software like R and, in the more recent versions, MATLAB® provide the Mersenne Twister as 
standard. The name already indicates that one of its key ingredients is a Mersenne number, Mp = 
2P — 1, which, when written in binary notation, is a bit string of length p with all bits set to 1. 
For example, the binary representation of M4 = 24 — 1 = 15 is 11112. In other words, M p is the 
highest integer that can be represented with a bit string of length p. This method uses more than one 
previously generated number and performs bitwise operations. Its properties are similar to those of 
linear congruential generators, yet its period length is M,. The commonly used version for 32-bit 
platforms, MT19937, has a period length of 21°37; in other words, it takes a sequence of more 
than 10° draws until it becomes repetitive. Also, sequences of all lengths up to 623 are equally 
probable, so there should be no obvious lattice issues. For most applications (and certainly for those 
discussed in this book), this seems to be a safe enough choice. 

There exists a large variety of approaches that test the quality of random numbers. One of the 
well-known sets of such tests are the diehard tests by George Marsaglia (see Marsaglia, 1995). 


6.3 Nonuniform distributions 
6.3.1 The inversion method 


As mentioned, uniform random numbers can be used to generate variables that follow any arbitrary 
distribution. The most general approach for this is the inversion method. The idea behind this 
is relatively simple but extremely powerful: Assume X has a continuous cumulative distribution 
function (CDF) F(X), and x; is a random draw. Then F(x;) is uniformly distributed, F (x;) ~ 
U(0, 1). 

To grasp this idea intuitively, let us assume X is standard normally distributed, X ~ N(0, 1). 
In this case, there is a 50-50 chance that a random draw will bring a positive or negative value 
since the median for a standard normal variable is 0. Any negative realization will come with a 
probability of F(x;) € [0, 0.5) Vx; < 0, whereas a positive draw (or the draw of x; = 0) comes 
with a density above 0.5, F(x;) € [0.5, 1] Vx; > 0. If we repeat this experiment n times, then 
half of the draws will be positive, the other half negative, and their respective densities F (x;) will 
be half of the times above 0.5, half of the times below. Plotting a histogram of the F (x;)s with 
just two bins ([0, 0.5) and [0.5, 1]) should, therefore, provide two bars of equal height; devia- 
tions are due to sampling error and should vanish when more samples are drawn. If we check 
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FIGURE 6.2 Confidence bands and probabilities for realizations within a region (xe, xp). 


FIGURE 6.3 Inversion principle. 


not only at the median but also at the quartiles,° chances for a standard normal to be —0.67449 
or less are F(—0.67449) = 0.25; accordingly, F(+0.67449) = 0.75, and we already know that 
F(O) = 0.5. Hence, there is a 25% chance that x; is within [oo, —0.67449], another 25% chance 
that it falls within [—0.67449, 0], and so on. The histogram of the F(x;)s for these new bins 
should again have bars of equal height and contain ”/4 observations each. It is easy to see where 
this example leads: there is a 10% chance that an observation falls into any of the 10 equally 
spaced quantiles of F'(x;), a 1% chance for any of the 100 equally spaced intervals [0, 0.01], 
[0.01, 0.02], etc. Making the bins increasingly finer eventually approaches a continuous uniform 
distribution. 

The inversion method now reverses this principle: It draws a uniform number u; ~ U (0, 1) and 
checks which x; has the corresponding F (x;). In our example, this is like picking the bin for the 
histogram first and then searching the x; that belongs to that bin. Therefore, x; corresponds to the 
ui quantile of F. Figs. 6.2 and 6.3 illustrate this. 

More technically, let F —l@)= qa be the inverse (or quantile) function for a CDF F(X) and 
ui a uniform random number, u; ~ U (0, 1). Then, x; = F~!(u;) is a random draw for X ~ F(X). 
Graphically, F~'(X) switches the x- and y-axis. Analytically, one needs to solve the definition of 
F(X) for X. This is possible for some distributions but not for all. 

More frequently, however, numerical methods have to be employed to crack this problem. Ac- 
cording to the method, the objective is to find the value x such that the CDF takes the value u: 
F(x) = u. This is equivalent to the zero-finding problem F(x) — u = 0; Chapter 11 discusses ap- 
propriate methods. The advantage of this approach is that it is the most generic one; basically, it 
works as long as the CDF F(x) can be computed for any given x. If one needs to generate many 
draws, this approach can be slow; in that case, analytical approximations for the inverse can be 
helpful. Section 6.4 provides some examples for this. 


6. As a quick reminder, gq is called the quantile for probability «œ if it has a cumulative distribution of F (qq) = a. In other 
words, with a probability of œ, random draws should produce values that are gq or less. The quartiles g25%, 450%, and q75% 
are special cases of quantiles. 
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6.3.2 Acceptance-rejection method 


To draw samples x with a density fy when the inverse of the CDF, F =l is not available, the 
acceptance-rejection method can be used. The main idea here is to take samples z from a man- 
ageable distribution with the same support and accept (or reject) and compare the two densities at 
this point. z is randomly accepted as a sample for x with probability fy (z) / (c fz (z)). The scaling 
constant c ensures that this ratio never exceeds 1; in other words, the scaled density of z is always 
above that of fx, cfz > fx; this is why cfz is called the majorizing function. 

To illustrate this principle, assume one wants to draw samples x from a standard normal dis- 
tribution, fy = (x). The maximum density is reached at (0) = 0.40. For the sake of simplicity, 
we assume that we can ignore values below —5 and above +5. For the majorizing density, one can 
choose a uniform distribution for the same support, z ~ U(—5, +5), and according to Eq. (6.1), it 
can be a transformed U (0, 1) uniform distribution. Its density in this range is a constant fz = 1/10; 


max fx _ 0.40 _ 
hence, c > —— = a =4, 


Strictly speaking, in this example fy is a truncated density and should, therefore, be adjusted 


slightly. However, this would be perfectly offset by the adjusted c. With r = f i = a, the 


MATLAB sampling procedure reads as follows: 


while true 
z= rand * 10 - 5; 
r = normpdf(z) / ( 4 + 0.1); 
if rand < r; 
SZ 
break; 
end 
end 


Obviously, the lower (higher) the density fy (z), the lower (higher) the probability of z being actu- 
ally accepted as a sample for x; therefore, the algorithm should produce samples that resemble the 
distribution fy. 

The uniform distribution is arguably not the best choice for a majorizing density here. The main 
downside is that it requires truncation. If one is also interested in extreme values, this is not a good 
thing. We could increase the support, but then there will be broad regions where the chances of 
acceptance are very small, and we would have to draw an increasing number of samples until a 
sample is accepted. Therefore, it might be more efficient to choose another majorizing density that 
roughly resembles the shape of the actual density of interest. 

For example, assume one wants to generate Gaussian variates x by sampling and accepting/re- 
jecting exponentials. The Gaussian density is 


1 2 
———— Fa [2 
oud Jon 


We only look at positive values of the distribution (and later flip a coin to get the sign); hence, 
Sx (x) = 2 (x) is actually twice the Gaussian density. The density of the exponential distribution 
with unit mean is 
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The ratio is 


26(Z) — V2 72.2 


e= sT 


This expression times !/c will be an acceptance probability; hence, c should be chosen such that the 
ratio will not take values above 1. The maximum for the ratio is found for z = 1, so we need to fix 
c as follows: 


ct = Le”? ~ 1.32. 


Any other c > c* would work as well, but it would be less efficient, since we would reject more 
often. The following figures illustrate this. 


Cty 2f,y 


~. 


The y-scale of these densities does not matter (only their ratio); hence, we omit the y-axis. In 
the left panel, c* is used. Suppose we draw a z of 0.3; the probability of acceptance is about 80%. 
In the right panel, we use a c of 2; the probability of acceptance drops to about 50%. Thus, we are 
much more likely to go through the loop without returning a random variate x. This scheme can be 
further simplified (Devroye, 1986, pp. 44—45); then, we obtain Algorithm 20. 


Algorithm 20 Simplified acceptance—rejection for standard normal variates. 
1: repeat 

2 generate exponential sample Z 

3: generate (0, 1) uniform sample U and set V=2U — 1 

4 

5 


: until (Z — 1)? < —2 log(|V|) 
: set Y = sign(V) Z 


Listing 6.2: C-RandomNumberGeneration/M/./Ch06/NormalAcceptReject.m 


1) function x = NormalAcceptReject (n) 

2|% NormalAcceptReject.m -- version 2011-01-06 
3) % generate n standard normal variates x 
4)x = NaN(n,1); 

Sfor 4. = lan 

6 while isnan( x(i) ) 

7 z = -log(rand) ; 

8 r = normpdf(z) / ( 1.32*exp(-z) ); 

9 if rand < r 

10 x(i) = z * sign(rand - 0.5); 

11 end 

12 end 

13 


end 


A special variant of the acceptance—rejection method is the ziggurat algorithm by Marsaglia and 
Tsang (2000), where the density is split into layers; a draw consists of first picking a layer and then 
an observation that falls within this segment. 
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FIGURE 6.4 Numbers within the unit square before (top panel) and after (bottom panel) Box—Muller transformation. 


6.4 Specialized methods for selected distributions 
6.4.1 Normal distribution 


The Box—Muller method 


A widely popular algorithm to generate variates that are N(O, 1) distributed has been suggested by 
Box and Muller (1958). Its beauty lies in several aspects: the approach is remarkably simple and 
fast, and it produces two uncorrelated draws.’ The idea is to use two (independent) uniform random 
numbers from the unit square, u1, u2 ~ U (0, 1), and to convert them into z;, z2 ~ N(0, 1): 


zı = y —2 log(u;) cos(27u 
1= y g(u1) (272) 


r Či 
z2 = y —2 log(u1) sin(2 muz). 
ee 

r Cy 


Let us start with the second part of the equations. For any value wz € [0, 1], the transformations 
Cx = cos(2 wuz) and cy = sin(2 uz) produce the coordinates for a point on the unit circle. The 
first part, r = ,/2log(u1), produces a value in the range of (0, oo): if u] goes to zero, r goes to 
—oo, and if u; goes to 1, r converges to zero. Multiplied with cy and cy, r scales the coordinates. 
Consequently, zı and z2 are now sitting on a circle of random radius r. Fig. 6.4 illustrates this. 


Listing 6.3: C-RandomNumberGeneration/M/./Ch06/GaussianBoxMullerSim.m 


1] function [z1,z2] = GaussianBoxMullerSim(N_samples) 
2|% GaussianBoxMullerSim.m -- version 2011-01-06 
3) if nargin < 1, N_samples=1; end; 


4|u = rand(N_samples,2); 
5} zl = sqrt(-2*log(u(:,1))) .* cos(2 x pi * u(:,2)); 
6|z2 = sqrt(-2*log(u(:,1))) .* sin(2 * pi x u(:,2)); 


Alternatively, one could also argue that first the radius of the circle, r, is picked randomly (via 
u1), and then a random point (z1, z2) on this circle is chosen (via u2). Note that, by construction, 
any point (z1, z2) has the same chance of being picked. This ensures that zı and z2 will be linearly 


7. How to generate variates that do exhibit correlation and other forms of dependencies will be presented in Chapter 7. 
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uncorrelated.® In other words, z; ~ N(0, 1) and Cov(z1, z2) = 0. Ina passing note, the Box—Muller 
transformation always requires two random uniform numbers, even if only one normal variate is to 
be produced. Fixing either u; or u2 will cause neither zı nor z2 to be normally distributed. 


Marsaglia’s polar method 


Based on the Box—Muller transformation, George Marsaglia’s polar method avoids trigonometric 
functions that used to be computationally time-consuming (see Marsaglia, 1991). The basic idea 
is to pick a point (a1, a2) from within the unit circle and transform its coordinates. The unit circle 
stretches both horizontally and vertically from —1 to +1; this means a), a2 € [—1, 1]. To achieve 
this, one can draw two uniform numbers and scale them according to Eq. (6.1): 


aj =—-14+(1—(-1))u;=-142u; for i=1,2 and u;~U(0,1). 


Obviously, the point (a1, a2) could also be very close to one of the corners of the square enclosing 
the unit circle; (0.9, 0.9) would be an example for such a case. To check whether the point actually 
is at or within the unit circle, we can use the Pythagorean theorem: the distance to the origin of 


point (a1, a2) is 
a 2 2 
r= aj + ay. 


The unit circle, with its center right at the origin, is defined by r being exactly one. So to meet 
the “‘at-or-within-the-unit-circle” rule, points with r > 1 cannot be used and must be replaced with 
another random draw of ajs. In passing note, we have just touched a Monte Carlo method of esti- 
mating the value of x. 


Example 6.1 


The area of a circle of radius r is Ae = 2rz; the area of the square enclosing it is As = (2r)*. Therefore, 
the area of the unit circle with r = 1 is exactly A, = x; the square enclosing it has a side length of 
2r =2 and an area of A; = 4. The ratio of the two areas is, therefore, Ac¢/As = 2/4. Now, this implies 
that a randomly picked point (a), a2) with a; € [—1, 1] should have a 2/4 chance to fall at or within 
the unit circle, that is, to sit on a circle with radius r < 1. For n draws, one would expect the number 
of points within the circle to be E(nw) =n - 7/4. Solving this equation, we get an approximation for z: 
x =4E(n,)/n. PiBySimulation.muses this principle. Note how larger sample sizes n give estimates 
closer to the actual value of x = 3.14159... 


Listing 6.4: C-RandomNumberGeneration/M/./Ch06/PibySimulation.m 


1) function piEst = PibySimulation(N_samples) 

2|% PibySimulation.m -- version 2011-01-06 

3 

4|a = rand(N_samples,2)*2 - 1; 

5)/r = sqrt(a(:,1).*2 + a(:,2).%*2); 

6| within_unit_circle = (r<=1); 

7) piEst = 4 * sum(within_unit_circle) /N_samples; 


Once a “valid” point with r < 1 is found, it can be transformed into a standard normal according 


zy =4/—2log(r2) =! and 22 =,/—2log(r?) Z. 
F r 


Again, zı and z2 have no linear correlation, so 


to 


zi~N(0,1) and Cov(zı, z2) =0. 


8. Recipes for generating related variates will be presented in Chapter 7. 
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Listing 6.5: C-RandomNumberGeneration/M/./Ch06/GaussianPolarSim.m 


l| function [z1,z2] = GaussianPolarSim(N_samples) 

2|% GaussianPolarSim.m -- version 2011-01-06 

3 

4| if nargin < 1, N_samples = 1; end 

5ļa = rand(N_samples,2) * 2 - 1; 

6|s = a(:,1).%2 + a(:,2).%2; 

T 

8|% --check for "at-or-within-unit-circle" criterion 
9| outside = s > 1; 

10| while sum(outside) > 0 

11 a(outside,:) = rand(sum(outside),2) * 2 - 1; 
12 s(outside,:) = a(outside,1).*2 + a(outside,2) .*2; 
13 outside = s > 1; 

14| end 

15 

16|% --perform transformation 

17) £ = sqrt(-2*log(s)./s); 

18) zl = £ .* a(:,1); 

19|z2 = £ .* a(:,2); 


By definition, any Gaussian variable, X ~ N (n, a”), can be standardized by first centering it 
and then dividing it by its standard deviation, 
X-u 
= 


X~N(u,07) + Z~N(0,1) where Z= 


Considering the opposite route, a standard normal variable can be transformed into any arbitrary 
normal distribution by scaling it with the required standard deviation o and shifting it by adding 
the location parameter (for this distribution equal to the expected value) u: 


Z~N(0,1) — > X~N(u,07), where X=Zo+p. 


Therefore, it is sufficient to have a generator for standard normal variables; all others can be derived 
by transformation. 


6.4.2 Higher order moments and the Cornish—Fisher expansion 


By definition, the normal distribution is symmetric and has a kurtosis of 3. In finance, we often 
find that empirical data have distributions that are (somewhat) bell shaped, but are heavy tailed and 
skewed. One way of dealing with this is the choice of alternative parametric distributions that share 
these properties, another one is to transform the data. A simple way to do so is the Cornish—Fisher 
expansion. The idea is to shift the quantile of the normal distribution, uy = NT! (œ), according to 


S K-3 S? 
Qu = tat = (u 1)+ a (u3 3a) x (2u3 Sua), 


where S and K are the skewness and kurtosis, respectively. Given mean u and standard deviation 
o, the critical w quantile can then be computed by 


go = u + Qgo. 


A simple MATLAB function is provided below. 
Listing 6.6: C-RandomNumberGeneration/M/./Ch06/CornishFisher.m 


function q = CornishFisher(r,alpha) 
CornishFisher.m -- version 2011-01-06 


oe 


= skewness(r); 


1 
2 
3 
4| s 

5| K = kurtosis(r); 
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FIGURE 6.5 Required skewness (left panel) and kurtosis (right panel), and sample moments with 500 observations per 
sample. 


6|u = norminv(alpha) ; 

7| Omega = u + S/6*(u%2-1) + (K-3)/24 *(u%*3 - 3xu) 
8 - S^2/36 x (2*u%3-5x«u); 

9|q = mean(r) + Omega * std(r); 


Similar to using kernel density estimates (see Section 6.7.2) rather than the raw empirical distri- 
bution, the Cornish—Fisher expansion can smooth the ragged nature of observations (in particular, 
in the tails of the distribution) and allow œ quantiles that would otherwise not be permitted by the 
granularity of the data (e.g., estimate the 0.1% VaR based on just 250 observations). 

In return, this approach can also be used for simulating data that have a distribution close to a 
normal, yet with some skewness and/or excess kurtosis. 


Listing 6.7: C-RandomNumberGeneration/M/./Ch06/CornishFisherSimulation.m 


function X = CornishFisherSimulation(mu,sigma,skew,kurt,N) ; 
% CornishFisherSimulation.m version 2011-01-06 
alpha = rand(N,1); 
u = norminv(alpha) ; 
Omega = u + skew/6*(u.*2-1) + (kurt-3)/24 *(u.%3 - 3*u) 
- skew*2/36 * (2*u.%3-5«u); 
X = mu + sigma * Omega; 


ADM FWNE 


However, note that this only works for reasonably small deviations from the normal distribution. 
Fig. 6.5 illustrates this: requiring a skewness within +0.5 works reasonably well, more extreme 
values are hardly possible. For excess kurtosis, the expansion tends to overreact, and samples have 
often higher kurtosis than required. 


6.4.3 Further distributions 


For convenience, rand refers to a (0, 1) uniform random number generator and randn to the one 
for standard normals. Several important distributions can be traced back to the normal distribution 
or transformations of it. 

Lognormal distribution X is lognormally distributed if log(X) ~ N(u, o?) is normally dis- 
tributed. We will meet this distribution again on several occasions since it is a popular choice 
for stock prices. 

z = randn*sigma + mu; 
x = exp(Z); 

x? distribution The sum of v squared normally distributed variables follows a chi-squared distri- 

bution with v degrees of freedom; the usual way of writing is x2. 
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z = randn(1,nu); 
Zee! 


x 


F distribution The ratio of two x? distributed variables, x; ~ ee , is F distributed: F,, v = 7 p ‘ 


zl = randn(1,nu(1)); 
z2 = randn(1,nu(2)); 
v= glezl’ / (22 * z2'); 

Student ¢ distribution The ¢ distribution with v degrees of freedom is defined as the distribution 
of the ratio Z/,/V/v, where Z ~ N(O, 1) and V ~ x2; Z and V are independent. In this 
incarnation, it has expected value 0 and variance 1. This distribution is important in statistical 
testing, and it is sometimes used as an alternative to the normal distribution to model asset 
returns because of its heavier tails: the lower v the higher the chances for extreme values 
compared with the normal distribution, and this is a feature that can also be found in empirical 
data. 

As mentioned earlier, V could, in principle, be sampled as the sum of v squared Gaussian 
variates. This requires sampling v + 1 standard Gaussian variates to compute one f variate. 
A more efficient algorithm is given in Bailey (1994): 


u rand(1,2); 
x = sqrt( nu * u(1)*(-2/nu)-1) * cos(2*pixu(2)); 


Exponential The exponential distribution comes with only one parameter, à, which is also its 
mean. It suffices to generate variates with unit mean (i.e., A = 1), since the scaled variable, 
mX, will have mean m. 


x = -log(rand) ; 


Laplace A Laplace variate with u = 0 and b = 1 is an exponential one with A = 1, plus a random 
sign (which is why the Laplace distribution is also called the double exponential distribution). 


u = rand(2,1); 
x = -log(u(1)) * sign(u(2) - 0.5); 


Cauchy The Cauchy distribution has a PDF of f(X) = xe and a CDF of F(X) = 1/2 + 
l/x arctan(X/c). The inverse is ø tan (x (U — 1/2)), but because of the symmetry of the tan- 
gent, variates can be simulated with o tan(z U). 


u = rand; 
x = sigma * tan(pixrand) ; 


Poisson The Poisson distribution has probability mass function 


e7% yk 


k! 


P(k) = for k=0,1,2,... 


For small jz, we can use the following algorithm (Ripley, 1987, Alg. 3.3): 


P= 1; 
N = 0; 
c = exp(-mu) 


Poisson variates are often used to model the occurrence of independent, typically rare events 
(in finance: jumps). For recipes to generate jumps over time, see Devroye (1986, Chap- 
ter 6). 
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Bernoulli Success (x = 1) has probability p, otherwise x = 0. 
x = rand < p; 


Binomial In a series of n draws, each time the probability of success is p. The probability of k 
successes is 


(o pa-p". (6.3) 


The binomial distribution gives the distribution of successes, that is, Eq. (6.3) for 0 < k <n. 
The simplest (but practical only for small n) idea is to generate n Bernoulli variates, and to 
sum them. 


x = sum(rand(n,1) < p); 


6.5 Sampling from a discrete set 


6.5.1 Discrete uniform selection 


In several of the nondeterministic methods covered in this book, one needs to randomly pick one or 
several elements out of a given set. Assume that the n elements of this set are indexed i = 1...n. 
If any element can be picked with the same probability of 1/n, this can be done in a simple fashion 
using uniform random numbers. Assume that n = 2, then the rule could be as follows: if a (0, 1) 
uniform random number u is within the range [0, 0.5), the first element is picked; if it is within 
the range [0.5, 1), the second element is chosen.’ Multiplying u by n increases the ranges to [0, 1) 
and [1, 2). The integer part of 2u indicates which of the elements is chosen: 0 for the first and 1 for 
the second. Hence, rounding 2u up to the next integer will return the index number of the element. 
This generalizes for arbitrary n > 1. In MATLAB, 


ceil (rand+*n) 


will, therefore, produce a uniformly distributed discrete number from the set {1,2,..., n}. 
If more than one draw is required, one needs to distinguish whether one element may be selected 
several times. If so, 


u = ceil(rand(1,d)*n); 


will produce a 1 x d vector u containing the draws from the set {1,..., n}, which may contain 
identical values. If an element may be drawn only once, there are several recipes. One would be to 
draw one element after the next one and ensure that it hasn’t been picked previously. A MATLAB 
implementation is as follows: 


u = NaN(1,d); 
for i = ded 
while isnan(u(i) ) 
test = ceil (rand*n) ; 
if not (ismember(test,u(1:(i-1)))) 
u(i) = test; 
end 
end 
end 


If d approaches n, this is probably not the most efficient way: for the later draws, only few valid 
options will be left, and an increasing number of invalid (and time-consuming) attempts are neces- 
sary to discover the remaining ones. In this case, a faster alternative is to shuffle the elements, place 
them in random order, and pick the first d elements: 


rndList = randperm(n) ; 
u = rndList(1:d); 


9. By construction, most uniform random number generators report neither the upper limit, 1, nor the lower limit, 0. 
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Here, the MATLAB function randperm (n) is used to produce a random permutation of the num- 
bers 1,...,n. (For a homemade version of randperm, one can generate a vector of n uniform 
variates and use their ranks.) Alternatives will be presented in Section 6.5.3. 


6.5.2 Roulette wheel selection 


The implicit assumption in Section 6.5.1 was that all elements are picked with the same probability. 
If deviations from uniformity are required, then a different approach must be chosen. Consider a 
simple roulette wheel. The chance that the ball falls into any of the slots will depend on the slot’s 
width: if one slot is wider, it will catch the ball more often. Assume the roulette wheel only had two 
slots, and one slot stretches over three-quarters of the wheel, the other one over just one-quarter. 
Then, the ball should have a 3/4 chance of landing in the first slot. Hence, if we standardize the 
circumference of the wheel with 1, then the first segment would stretch from 0 to 0.75, and the 
second one from 0.75 to 1. To simulate such a wheel, one could pick a (0, 1) uniform numbers u 
and assign values below 0.75 to the first slot, those above 0.75 to the second. This is equivalent, by 
the way, to the approach chosen for simulating Bernoulli variables (see page | 16): if the probability 
for success is p, then u < p will give true if u falls within [0, p) and false otherwise. 

This principle can be generalized to an arbitrary number of possible outcomes. Each of the 
outcomes is assigned one slot, and the slot’s width in proportion to the circumference reflects its 
probability of being chosen. The limits between the slots on a standardized roulette wheel (circum- 
ference = 1) are then at 0 and at the cumulative probabilities. For example, assume there are three 
alternatives, and they should be picked with chances 3: 5 : 2. Then, the (cumulative) probabilities 
are 


Alternative 1 2 3 
Probabilities 0.30 0.50 0.20 
Cumulative probabilities 0.30 0.80 1.00 


A (0, 1) uniform random number has a 30% chance to fall into the first slot, [0, 0.3), a 50% 
chance for the second slot, [0.3, 0.8), and a remaining 20% chance for the third slot, [0.8, 1). To 
make a draw, one can then pick a uniform random number, u, and determine the corresponding slot, 
that is, find the first slot where u does exceed the lower bound, but not the upper bound. 

The MATLAB code for roulette wheel selection is provided below. It accepts two arguments: the 
first is a vector of individual chances or propensities, the second (optional) is the number of draws. 


Listing 6.8: C-RandomNumberGeneration/M/./Ch06/RouletteWheel.m 


function w = RouletteWheel (prop,N) 

% RouletteWheel.m -- version 2011-01-06 

% roulette wheel selection 

% prop ... propensities for choosing an element ( > 0 ) 
S N smia number of draws 

if nargin < 2, N = 1; end 


% -- compute (cumulative) probabilities 
prob = max(0,prop)/sum(max(prop,0)); 
cum_prob = cumsum (prob); 


=n 
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% -- perform draws 

w = NaN(1,N); 

for 2 = 1N 

rand; 

w(i) = find(u < cum_prob,1); 
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6.5.3 Random permutations and shuffling 


An integral part of some methods is to randomly (re-)arrange elements. What is a good method for 
this depends on how big the variations should be. If only minor variations in an existing sequence 
{x;} are required, one minor disturbance would be to randomly pick just two elements, i; and i2, 
and swap them: 


i = ceil(rand(1,2)*length(x)); 


while i(2) == i(1) 

1(2) = ceil (rand*length(x)); 
end 
x(i) = x(i([2 1])); 


Repeating this procedure adds disturbance; however, it is not the most efficient way of shuffling 
an entire sample. In that case, a random new order can be picked, and the elements arranged, 
accordingly. The MATLAB function sort (u) returns two vectors, the sorted elements and the 
corresponding positions in the original, unsorted vector. Hence, 


u = rand(1,n); 

[ignore,i] = sort(u); 
assigns i a random sequence of the numbers | to n. The MATLAB function randperm(n) does 
exactly this. 

In some circumstances, it might be helpful to move a subsequence of length s. A simple ap- 
proach for this is to randomly select a sequence, find a new insertion point, and rearrange the parts. 


Listing 6.9: C-RandomNumberGeneration/M/./Ch06/rearrange.m 


1) function x = rearrange(x,s) 

2|% rearrange.m -- version 2011-01-06 

3) % x ... data sample 

4|% s ... length of segment to be moved 

5 

6|/n = length(x); 

7) i_start = ceil (rand« (n-s+1)); 

8&li = i_start + (0:(s-1)); % elements to be moved 
9ft = ceil (rands* (length(x)-s+1))-1; % target position 

10| chunk = x(i); 

ll)x(i) = [1]; % remove chunk 

12|x = [x(1:t) chunk x(t+l:end)]; % insert at new position 


6.6 Sampling errors—and how to reduce them 
6.6.1 The basic problem 


Monte Carlo simulations heavily rely on the “law of large numbers”: the distribution of (increas- 
ingly) large samples should converge to the distribution of the underlying population. Hence, when 
tossing a coin, say, n = | million times, then heads actually should come up approximately half 
of the time. With smaller sample sizes, however, this does not necessarily hold: for only 10 tosses, 
there is a reasonable chance (11.7% to be exact) to see seven heads and only three tails. Using 
this particular sample would then give us a biased picture due to the sampling error. Unfortunately, 
even when samples appear large, this Monte Carlo error can still be noticeable, for example, when 
sophisticated models are used or the analysis only focuses on certain parts of the distribution like 
the tails (neither of which is unusual in finance—just think of risk models for demanding portfolios 
or assets). More importantly, the number of draws cannot always be increased sufficiently to limit 
this error to an acceptable level; CPU time, memory, and other computational limitations are at the 
top of the list of potential restrictions. 

There are different approaches to overcome this issue. They all have in common that they inter- 
fere with the way samples are drawn. In particular, they abandon the goal of producing (seemingly) 
independent draws. The following subsections discuss some of the more popular methods. 
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6.6.2 Quasi-Monte Carlo 


General considerations ... 


Let us revisit the coin-flipping example for a minute. If we use the outcomes head (H) and tails (T) 
to represent a “yes/no” decision occurring with equal odds, the ideal outcome for n draws would 
be n/2 for either case. If one is only interested in frequencies and the sequence is irrelevant, a 
constructed solution for n = 6 draws that is “perfectly” distributed would simply assign one half the 
draws to “H” and the other half to “T.” This “tailoring samples to fit the distribution” is, in essence, 
what quasi-Monte Carlo (QMC) does. Hence, a QMC solution here would be [H H H T T T], and 
another one [H T H T H T]. The latter has the advantage that discarding the last two observations 
would immediately provide the solution for n = 4 draws while the former wouldn’t. In either case, 
serial patterns emerge that are highly unlikely in genuine random numbers. 

Constructing a solution for continuous distributions is also feasible. Consider the (0, 1) uniform 
distribution. To evaluate the uniformness of the points, a discrepancy measure can be introduced 
that checks whether all observations are positioned equidistantly. When sorting the draws, xç) < 
x¢41) Vi=1,...,n—1, uniform numbers should be equidistant; x@41) = xa) +6 Yi = 1,..., n — 
1. The tricky bit here is how to deal with the draws closest to the (lower and upper) limits, that 
is, x(1) and X(n). A wrap-around discrepancy measure would favor a solution where the distance to 
the limits is 4/2; that is, x(a) = 0 + 4/2, x2) = xa) + 6 = 34/2, and generally x(7) = xG—1) + ô = 
(2i — 1)8/2. 

To make this more tangible, if n = 1, then the only point will be ideally positioned at x(1) n=1 = 
0.5. If n = 2, the distances between points ought to be 6,—2 = 0.5, and the positions are x(1) ,—2 = 
0.25 and x(2) n=2 = 0.75, for n = 3: {1/6, 3/6, 5/6},,3; for n = 10: {0.05, 0.15, 0.25, ...,0.95}n=103 
and so on. For all of these cases, the expected value of the sequence is exactly 0.5; the variance, 
however, only converges to its theoretical value of !/12 if n is large. Therefore, alternative construc- 
tion methods position the points such that higher moments also agree with their theoretical values 
as well as possible. For a comprehensive discussion, see Niederreiter (1992). 


...and caveats 


When looking at these examples more closely, one can already spot some limitations of these 
methods for financial applications. 


e The optimal solution for n + 1 is usually not found by taking the solution for n and adding 
one extra point. Likewise, choosing the values for high n but using only some of them (e.g., 
because one runs out of time) will lead to a (systematic) bias. Hence, the designer has to decide 
in advance how many samples he or she wants to evaluate. 

e Determining the draws for the multivariate case can become a tough optimization problem in its 
own right—even (and in particular) when the draws must be orthogonal (i.e., uncorrelated) in all 
dimensions. Fang et al. (2006) provide bounds and show how heuristics can help in tackling this 
problem. 

e Repeated experiments generate exactly the same sequence. On the one hand, this is good as it 
simplifies replicability. On the other hand, this is not so good because if an interesting spot in the 
range is missed once, then repeated experiments will not cover it either. Even more importantly, 
it is difficult to judge how stable the results are: Results with pseudorandom numbers will vary 
from experiment to experiment, but should converge with increasing sample size. If results are 
close together, this could indicate stability; if they are all over the place, then the results are 
obviously not very robust. With quasi-Monte Carlo numbers, the results will be identical unless 
one changes n; and unless n changes dramatically (in particular when n is already high), the 
variations in the points could be modest. In finance, one is often interested in extreme risks and 
rare events, and generating draws with QMC might lead to biased and unreliable results. 

e In financial applications, it is often not only the distribution of the numbers that matters, but 
also its sequence. Assume n = 100 uniform QMC numbers are used to generate returns. If the 
draws are constructed as in the example above, they will be sorted, and so will the returns. This 
might be fine if one is interested in a simulation of stock prices at one particular point in time; 
for example, the distribution at the maturity date of an option. However, if one is interested in a 
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sample path over 100 days, it is a different story. To avoid undesirable sequential patterns, other 
construction methods have to be used. In short, not every QMC approach is equally suitable for 
all purposes. 


Despite all these limitations, quasi-Monte Carlo can be a very useful and efficient tool for fi- 
nancial applications—when handled with care and if one is aware of their limitations. Section 8.2.3 
will discuss this in the context of option pricing. 


6.6.3 Stratified sampling 


As mentioned earlier, Monte Carlo experiments with small samples can have the disadvantage that 
some important areas in the probability space could be left out. At the same time, there could be 
clusters of draws so that other areas are overrepresented. If quasi-Monte Carlo is inappropriate or 
too difficult to apply, then stratified sampling can be used to reduce the sampling error. The idea is 
to split the probability space into B segments of equal probability and draw N/B random samples 
from each of these segments. This ensures that the draws do not have too obvious clusters, while 
still preserving some of the Monte Carlo properties. If B = 1, then this is (usually) equivalent to 
“plain vanilla” Monte Carlo experiments; increasing B — N brings it closer to a quasi-Monte Carlo 
experiment. 

There are many different versions to translate this general concept into an actual application. 
The first step always is to fix the number of segments or bins, B. Again, for the sake of simplicity, 
assume we want to draw N = 100 samples from a (0, 1) uniform distribution. If one chooses B = 
5, then the first bin would be [0, 0.2), the next one [0.2, 0.4), etc. The next steps can vary; the 
outcome, however, should be similar. To illustrate how to find solutions, consider the following 
three examples. 


e Inarather systematic fashion, one could draw one number from each bin and repeat this exercise 
N/B = 100/5 = 20 times. Here is a simple MATLAB implementation: 


B = 5; 

N = 100; 

bin = repmat(1:B,1,ceil(N/B)); 
u = rand(1,N)/B + (bin-1) /B; 


At this stage, the numbers will be piecewise sorted. Plotting the vector u will illustrate this. If se- 
quence does matter, these numbers can be rearranged randomly or systematically; see page 118. 
Alternatively, one can permute the vector bin that contains the bin for each draw. The modified 
code then is: 


ay 
pis 
B 
| 


= repmat(1:B,1,ceil(N/B)); 
bin = bin(randperm(length(bin) )); 
u = rand(1,N)/B + (bin-1)/B; 


e Another idea would be to draw N samples and check which bins are overrepresented and which 
are underrepresented. Excess observations from the former will be discarded and replaced with 
additional draws from the latter bins. Choosing which of the excess observations to keep and 
which ones to discard could be a deterministic or random decision. 

e A third approach would be to use the roulette wheel principle and draw a sequence of N numbers. 
Initially, each bin should have the same probability of being chosen; over time, however, those 
bins with more elements in them should have a lower propensity of being chosen. In other 
words, the propensity should reflect the number of still available “free” slots in a bin. If each 
bin b=1,..., B eventually contains at most N pr elements, then the number of free slots is 
N, a — Np, where N, is the current number of elements in this bin. If all bins must eventually 
have the same number of elements, then M a = N/B and the versions above would be more 
efficient implementations. However, if one allows for some variations and capacities (i.e., there 
is an upper limit on observations from one cluster and hence on clusters), then this could be 
reasonable. Likewise, one could design versions that ensure that a minimum amount of draws 
does come from each bin; the “cookbook recipe” then could be to first draw the necessary number 
of samples for each bin, and then fill it up with arbitrary draws; add permutation if necessary. 
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6.6.4 Variance reduction 


Monte Carlo methods are based on the law of large numbers. Basically, it states that suitable sam- 
ples should have properties similar to that of the population, and the larger the sample, the greater 
this similarity. Sometimes, dissimilarities are by design: a finite number of draws from a continuous 
space will never be exactly continuous, and, in particular, rare events will not be represented per- 
fectly. More importantly, Monte Carlo methods use random variables—which makes the outcome a 
random sample; and the statistical signatures will also be random. Ideally, the reported statistics of 
the samples have little variance and are centered around the population’s corresponding values. To 
decrease variance and “randomness” of results by experimental design, increasing the sample size 
helps (law of large numbers!) but is computationally costly. Alternatively, one can use a sampling 
process that favors fast variance reduction. 

Quasi-Monte Carlo is targeting this issue for small samples by constructing variates that mini- 
mize the dissimilarity; they do come with other limitations, though (see Sections 6.6.2 and 9.3.4). 
Another approach would be to gear the generation of (pseudo)random numbers in a way that re- 
duces the variance of the results. 


Antithetic variables 


To assess the magnitude of the sampling error, the central limit theorem (CLT) can help. Under 
certain (usually met) conditions, the mean of i.i.d. samples is approximately normally distributed, 
x ~ N(ux, ox/n), where n is the sample size and wx and ož are the mean and variance of the 
random variable X. The reason for this becomes obvious when looking at the definitions: 


_ | 1 n(n — 1) 
Var() = Var | X` mxi | = z Var(xi) + — 


i=1 


Cov(xi 3 xj) 
n F 


if all samples x; have the same variance and all pairs (x;, xj) have the same covariance. In other 
words, the sample mean will have a standard deviation of ox /./n. So, to reduce the sample mean’s 
standard deviation by half when samples are uncorrelated, one needs four times as many indepen- 
dent samples; to reduce it by one order of magnitude (i.e., by a factor 10), n must be increased by 
a factor 100. 

The decrease in the sample mean’s variance could be accelerated if the samples had negative 
covariance; the most extreme case here would be a perfect negative correlation. Variance reduction 
with antithetic variables does exactly that: draw (or construct) samples that are perfectly negatively 
correlated. Assume x; ~ N(0, 1). Then, x; = —x; would be the perfect antithetic variable, !° and 
the pair (x;, xj) would have exactly the desired mean of the underlying distribution: x = u = 0. 
For (0, 1) uniform variables, pairs (u;, uj), where uj = 1 — u1, would be the equivalent. These 
antithetic uniforms can then be used to generate samples for other distributions, for example, by 
using the inversion method. 

The problem with this approach is that it is suitable only for certain parameters and statistics: 
the sample mean might converge very fast, but, for example, variance could be misrepresented. 
Therefore, antithetic variables are less useful in models where variance and/or higher moments are 
at least as important as the mean. Just think of pricing an out-of-the-money option: getting the 
expected price of the underlier correct is not sufficient to price derivatives. 


Importance sampling 


A major issue in financial simulations are rare and extreme events that have a great impact on the 
overall outcome. By definition, they should not occur too often, so in particular in small samples, 
one cannot really get it right: If they occur, they might be overrepresented; if they don’t, they get 
completely ignored. Variance reduction (e.g., by antithetic variables) isn’t a solution; just think of 
a call option at maturity: extreme positive events in the underlying’s returns have a massive impact 
on the call when it is in-the-money, whereas differences in the prices below the strike price will 
make no difference, regardless of how extreme they are. 


10. Note that for x; ~ N(z, o2), Xj = 2p — xi. 
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The idea of importance sampling is to prioritize sensitive regions and draw more samples there. 
The samples are then weighted appropriately to correct overrepresentation or underrepresentation. 
Intuitively, the weight should be the inverse of the factor of overrepresentation. 

To illustrate this principle, assume S is lognormally distributed with log(S) ~ N(u, o?) and 
one is particularly concerned with extreme positive values. In that case, one could draw, say, three- 
quarters of all samples from the upper 10% quantile. In other words, wn = 0.25 - n of the samples 
should fall into the lower quantile, q = F (s;) < 0.9. 


oe 


--setting 

= 1000; % number of samples 

0.25; % fraction of samples in lower quantile 
= 0.90; % lower quantile ends at 


Q SB 
1l 


% --draw uniforms according to importance 


ul = rand(n»w,1) he g; % [0..q] 


u2 = rand(nx(1l-w))* (1-q) * q; % [q..1] 
% --normal and log-normal variates, e.g., inversion method 


sl = norminv(ul,mu,sigma) ; S1 = exp(s1); 

s2 = norminv(u2,mu,sigma) ; S2 = exp(s2); 

% --compute mean 

mean_S = ( sum(S1) * q/w + sum(S2) * (1-q)/(1-w) ) / n; 


% --alternatively: 

mean_S = mean(S1)*q + mean(S2)x(1-q); 

The practical application of importance sampling can be hampered, though, when these samples 
are used as an ingredient for more sophisticated models. For a “plain vanilla” European option, the 
important areas of the underlier’s distribution and the corresponding weights are easy to find; this 
might not be true for structured products. 


6.7 Drawing from empirical distributions 
6.7.1 Data randomization 


Drawing from a given sample, that is, resampling, is an important issue in finance. For simulation 
purposes, empirical data can be used either directly (e.g., in bootstraps) or as basis for the calibra- 
tion of parametric models (e.g., parametric bootstraps). At the same time, one ought to check how 
“typical” or unusual a data set is. There exists a variety of alternatives; some of them can be found 
in Gentle (2009, Chapter 12). 

At this point, we only want to mention the jackknife, where a small number of observations 
are excluded from the sample. If the statistics of the remaining observations change noticeably, 
then the excluded observations are considered somewhat unusual. For financial data, jackknifing 
might also reveal something about robustness of results. Taking the daily returns from a year and 
ignoring the one or two best or worst days gives some indications what would have happened to a 
buy-and-hold investor who was (un)fortunate enough to be out of the market only on these days. 
Though rare, these extreme events tend to occur regularly and can have a substantial impact on the 
overall results. Fig. 1.2 in Section 1.4 illustrates this problem; Fig. 16.16 on page 526 provides a 
drastic example of how data errors can aggravate it; and Section 14.2.4 investigates the differences 
among true, estimated, and realized return properties of portfolios. At this stage, it is reasonable to 
assume that the available data are clean and representative. 


6.7.2 Bootstrap 


Basic concepts 


A common problem in empirical work is that the true data-generating processes and underlying 
distributions are not observable, and only samples of observations are visible. These observations 
can be used to test hypotheses about assumed underlying models; they can also be used to calibrate 
these models. In other situations, it is opportune to use these data directly for simulations. In this 
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case, one effectively samples from a sample. The basic principle here is to treat the available sample 
as if it was, more or less, the underlying population. 

Obviously, there are some limitations and simplifying assumptions to it. The sample can (and 
almost always will) not be a perfect fit for the underlying distribution. The sample will be finite and 
discrete even if the underlying process is continuous. Nonetheless, the empirical distribution of a 
sample, Fo, is often the closest one can get to the actual underlying distribution F, and ideally, the 
estimator used on F, produces the same results as if used on F itself. Given the focus of this book, 
a discussion of the underlying statistical theory is omitted; a good starting point is Gentle (2009, 
Chapter 13). 

In finance, a popular application for bootstrap methods is the generation of returns. The simplest 
example here is the univariate case of assumed i.i.d. realizations of a single asset. Assume that a 
vector {7;}7 1 contains the available samples of the i.i.d. returns for a single period. Resampling n 
observations into a new vector {x}, 1 can then be done by 


j = ceil(rand(n,1)*T); 

xej); 
Resampling should leave the basic statistics intact, and the bootstrapped samples x will have (ap- 
proximately) the same mean, variance, and higher moments as the initial sample r. 

Extending this to the multivariate case is straightforward. If {r;}rxķ is a matrix where each 
column is for one asset and each row represents one observation of simultaneous returns of these k 
assets, then the code reads as follows: 


j = ceil(rand(n,1)*T); 

x = r(j,:); 

In this case, the correlations, dependencies and co-moments!! across assets should also remain 
in place. This approach can, therefore, be used when simulating n realizations for a single-period 
development of an entire portfolio. 

When more than one period is of interest, things become slightly more challenging. In the ab- 
sence of serial dependence within the data, the new sample {x} can also be regarded as a series 
of subsequent i.i.d. returns. For empirical data, however, the assumption of independent returns 
often breaks down. Maintaining the time-series properties of returns is then important. If there is 
no (parametric) way to cater for this serial dependence, then a block bootstrap method can preserve 
at least some of the original properties. Rather than single data points, one picks blocks of b sub- 
sequent observations from the original sample {r}. For the multivariate case, one generic approach 
would be as follows: 

for i = 1:b: (n-b) 

j = ceil(rand(n,1)*(T-(b-1))); 
x(i:(j+b-1),:) = r(j+(0:(b-1)),:); 

end 
This, however, causes the first and Tth observations to be picked only if j is 1 and T — b + 
1, respectively, while an observation i somewhere in the center of the sample will be included 
in any block with starting point j = (i — b+ 1),...,i. In particular leaving out the most recent 
observations is not desirable for financial simulations, as they might be quite typical of what to 
expect next. Under the assumption of stationarity, the circular block bootstrap solves this problem 
by filling the (T + d)th position with observation d. The modulo function can be used for this: 


j = mod(j-1,T) + 1; 


The complete MATLAB function for bootstrapping is provided here: 


11. Unlike the covariance, coskewness and cokurtosis are rarely used. If data follow a joint normal distribution, they contain 
no information, whereas they can contain interesting information for nonsymmetric joint distributions. They are, however, 
numerically challenging and costly to compute. For details, see, for example, Konno and Suzuki (1995), Maringer (2008b), 
Maringer and Parpas (2009). Section 8.7.2 provides an example where higher order moments actually are an issue in real 
data. 
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Listing 6.10: C-RandomNumberGeneration/M/./Ch06/bootstrap.m 


function x = bootstrap(r,n,b) 


% bootstrap.m -- version 2011-01-06 

% r ... original data 

% n... length of bootstrap sample x 
% b ... block length 


if nargin < 3, b= 1; end 


[T k] = size(r); 
if b== % simple bootstrap 
j = ceil(rand(n,1)*T); 
x = 4(j,:); 
else % circular block bootstrap 
nb = ceil(n/b); % number of bootstraps 
js = floor(rand(nb,1)*T); % starting points - 1 


x = nan(nbsb,k); 
for i = 1inb 
j = mod(js(i)+(1:b), T)+1; % positions in original data 
s = (1:b) + (i-1)*b; 
( 


Ree ee ee ee ee 
OWADMPWNFTDOANANIADUNHWN 


zts el = EO e)s 
20 end 
21 if (nbn) >n % correct length if nb*b > n 
22 x((n+1):end,:) = []; 
23 end 
24| end 


There is no universal solution for how to choose the block length b. Low values will break up 
the time structure, whereas high values will produce mostly identical resamples. Also, if there are 
very strong temporal patterns in the initial sample, cutting out chunks and arbitrarily stacking them 
together again can lead to unrealistically abrupt transitions. Bootstrap methods should, therefore, 
be used with caution. 


Parametric and nonparametric bootstraps 


By definition, the bootstrap method samples from a given data set. Problems can arise when the 
original data set either is small relative to the number of bootstrap samples or comes with rare 
events. The samples will be too similar and not necessarily represent the entire distribution. This 
can lead to bizarre behavior in the results and to overfitting,!? and bigger sample sizes will only 
increase the computing time but not the quality of the simulation. 

If one has a good idea of the data-generating process, he or she can try to fit a parametric model 
or distribution on the data and then sample from it. If, for example, one assumes that a stock price 
is lognormally distributed, all users need to do is to estimate the mean and standard deviation of the 
(normally) distributed log returns, sample corresponding normal variates, and transform them into 
prices. Chapter 8 demonstrates this in more detail. This “parametric bootstrap” reaches its limits, 
however, when not all crucial properties of the data are captured by the model—or if there is no 
obvious choice of model. 

One “quick and dirty” fix would then be to jitter the data by adding some small noise, £, to each 
sample. The usual suspect for this is Gaussian noise with standard deviation o, 


xs = x(j) + randn(size(j)) * sigma 


This brings us to nonparametric methods. Assume one wants to generate random draws based on 
a finite sample of observations {x;} with unknown underlying density. Kernel density estimators can 
then be used to create a smoothed empirical probability density function (EPDF). For the univariate 
case, this can be done with 


it; X— Xj 
Ho = Yok ( ; | 


i=l 


12. See, for example, Maringer (2005a). 
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TABLE 6.2 Popular kernels, K (y), for kernel density estimation. 


Type Ky) Example 
1 -l<y<l 
Uniform 2 eis) 
0 otherwise 
. l=|y| =lsysil 
Triangular : 
0 otherwise 


Epanechnikov 


3(1-y*) -l<y<l 
0 otherwise 


s 1 =y2/2 
Gaussian ma 
V20 


where K(-) is the kernel, and h is the smoothing (or bandwidth) parameter. Early suggestions for 
kernels include the uniform one, producing a shifted histogram (see Rosenblatt, 1956). Meanwhile, 
the Gaussian, the triangular, and the quadratic (or Epanechnikov) kernels have become very popular 


choices. For normally distributed data, Silverman (1986) shows that a bandwidth of h = s,/ + is 
optimal for the Gaussian kernel, where s is the standard deviation of the sample. Table 6.2 provides 
the definitions of different kernels and sample fits. Once the EPDF has been estimated, samples can 


be drawn using, for example, the acceptance—rejection method (see Section 6.3.2). 


Listing 6.11: C-RandomNumberGeneration/M/./Ch06/KDE.m 


l| function f_x = KDE(x,xi,h,Kernel) 

2|% KDE.m -- version 2011-01-06 

3) % kernel density estimation 

4|% xX ... points at which to estimate the density 
5|% xi .. original sample 

6|% h ... bandwidth, smoothing parameter 
7|% Kernel .. name of kernel 

8 

9| if nargin < 4 % default function 

10 Kernel = ‘Gaussian’; 

11 if nargin < 3 % default value for h 

12 h = (4/(3«*«length(xi)))*.2 » std(xi); 
13 end 

14| end 

15 

16| switch upper (Kernel) 

17 case ‘GAUSSIAN’, 

18 K = @(y) exp(-y.%2/2) / sqrt(2*pi); 
19 case ‘UNIFORM’, 

20 K = @(y) (abs(y) <1 )/2; 

21 case ‘TRIANGULAR’, 

22 K = @(y) max(0, (1l-abs(y))); 

23 case {’QUADRATIC’ , ‘EPANECHNIKOV’ } 

24 K = @(y) max(0, 0.75*(1-y.%*2)) ; 


25 otherwise 
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26 error(’Kernel not recognised’ ) 
27| end 

28 

29| f_x = zeros(size(x)); 

30|n = length (xi); 

31| for j = 1:length(x) 

32 y = (x(j) - xi(:))/h; 

33 f_x(j) = ones(1,n) * K(y) / (hen) ; 
34) end 


The choice of the smoothness parameter h can have a substantial impact. Again, consider the 
case of daily stock returns with one extreme event. If h is too low, fy will have a narrow spike at this 
event; if h is too large, distinct features of the samples become smoothed out; Fig. 6.6 illustrates this. 
Also, for multivariate data, the increasing number of dimensions can quickly increase the compu- 
tational costs. 


-2 -1.5 1+ 05 0 0.5 1 15 2 


FIGURE 6.6 Gaussian kernel density estimates with smoothing parameters h = 0.025, 0.1, and 0.5, respectively (darker 
lines for larger A). 


Taylor and Thompson (1986) suggest an alternative nonparametric approach that also takes “lo- 
cal” information into account but requires even less assumptions about the underlying density. Their 
algorithm randomly picks one observation x; and its nearest neighbors (e.g., based on Euclidean 
distances), and a sample is created within their proximity. Intuitively speaking, if the neighbors are 
close, the new sample will be close to their center, whereas if they are dispersed over a large area, 
the new sample will also have higher variance. Algorithm 21 provides some additional technical 
details. 


Algorithm 21 Taylor-Thompson algorithm. 


1: randomly pick one observation xj 


2: find the m — 1 observations closest to Xj Xjes X jy 
3: compute the mean x, of the m observations Xj i= 1,...,m 
. 1 1 
4: generate random uniform samples uj ~ U (4 V3(m — 1)/m?, mtv 3m — 1)/m?) 


m 

5: compute the linear combination e = J` u; (x ji — Xs) 
i=1 l 

6: compute simulated value xs = x + e 


Note that the weights u; have expected value !/m while their variance increases with increasing 
m to outbalance the loss of variance in the samples by averaging. Fig. 6.7 illustrates the effect of m: 
the fewer of the available observations are used, the more the new samples seem to cluster around 
or directly between the existing points; increasing m will produce samples from wider area. 


Listing 6.12: C-RandomNumberGeneration/M/./Ch06/TaylorThompson.m 


1) function Xs = TaylorThompson(X,N,m) 

2|% TaylorThompson.m -- version 2011-01-06 

3] % X ... original sample 

4|% N ... number of new samples to be drawn 
5|% m ... number of neighbors to be used 

6 

7| [xr xc] = size(X); 

8|% -- compute Euclidean distances 
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FIGURE 6.7 Samples generated with the Taylor-Thompson algorithm; original observations (circles) from a (0, 1) Gaus- 
sian distribution. 


9B = X X'; 
10| ED = sqrt (repmat (diag(B),1,xr) + repmat(diag(B)'‘,xr,1) - 2*B); 


12|% -- limits for weights 
13}m = min(xr,m); 
14| uLim = 1/m + [-1 1] * sqrt(3*(m-1)/m.%*2); 


16|% -- draw samples 
17| Xs = zeros(N,xc); 
18| for s = 1:N 


19 j = ceil(rand*xr) ; 

20 % -- m nearest neighbors: 

21 [dnn inn] = sort(ED(:,j)); 

22 Xnn = X(inn(1:m),:); 

23 % -- weights 

24 u = rand(1,m) * (uLim(2)-uLim(1)) + uLim(1); 
25 % -- form linear combinations 

26 Xnn_bar = ones(1,m) * Xnn /m; 

27 e =u * (Xnn - repmat(Xnn_bar,m,1)); 

28 Xs(s,:) = Xnn_bar + e; 


29| end 


6.8 Controlled experiments and experimental design 


6.8.1 Replicability and ceteris paribus analysis 


Being able to control and reproduce variates is one of the major advantages of pseudorandom 
number generators: it allows the reproduction of previous experiments without having to store all 
generated variates. This is particularly valuable when previous results need to be reexamined, and 
effects due to sampling are to be investigated. Even more importantly, it allows for controlled 
experiments. Typical applications include cases where only a restricted number of variates are to 
be resampled for ceteris paribus experiments. Knowing the type of random number generator and 
the seeds used is enough to replicate the same variates and modify or substitute others. 

Likewise, it facilitates easier comparisons and evaluations for variations in the data-generating 
model. Figs. 8.5 and 8.7 will compare how different parametrization for time-series models produce 
signatures; reusing the same underlying random numbers helps isolating the effects. One of the ad- 
vantages of the inversion method is that it allows direct comparison of simulations with slightly 
altered distribution properties. Consider the following example: Financial returns are often charac- 
terized by heavy tails; a simple means to achieve this is to use the Student ¢ distribution instead 
of a normal distribution. To analyze how increasing kurtosis affects the outcome, ceteris paribus, 
one can generate a uniform sample, {u;} ae and then convert it into the corresponding normal or 
Student t value, | ae ({u;}, v), for different degrees of freedom, v. 

Similarly, it can be used to switch between distributions. Given, say, a standard normal sample 
{zi} Ra , the corresponding samples for a ¢ distribution can be converted by first converting the nor- 
mal variates into uniform ones via the CDF, {u;} = N({z;}), which then are plugged into the inverse 
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of the target CDF. More generally, to convert a sample x; from distribution Fy to distribution Fy, 
all one needs to do is Fy l (F. X (i). This helps to separate noise from effects due to changes in the 
model specification. 

At the same time, controlling the experiments too heavily is not good either. Similar to variance 
reduction techniques, the “natural” variation in samples is reduced, and not all derived statistics will 
benefit equally. Also, seeding can undermine the quality of the random number generator itself, as 
the following sections illustrate for some of MATLAB’s random number generators. 


6.8.2 Available random number generators in MATLAB 


MATLAB offers different types of RNGs for uniformly and normally distributed numbers.'* They 
can be invoked together with setting the seed to a certain value s in the following (meanwhile 
outdated, yet still working) fashion: 


rand(‘seed’,s) fora multiplicative congruential algorithm (i.e., a linear congruential gener- 
ator with parameter c = 0, see footnote 5; default in version 4); 

rand(‘state’,s) for Marsaglia’s subtract with borrow algorithm (default in versions 5 
through 7.3); 

rand(‘twister’,s) for the Mersenne Twister (default since version 7.4; widely regarded as 
the best). 


Setting the seed for the uniform RNG also resets the RNG for other distributions, including the 
ones provided with the statistics toolbox. Standard normal pseudorandom variables have their own 
seeds, and the user can choose between different types of RNGs: 


randn(’seed’,s) forthe polar RNG (default in Version 4); 
randn(’state’,s) for the ziggurat RNG (default since Version 5). 


Since version 2008b, MATLAB also offers streaming: different variables can have individual 
streams of pseudorandom numbers (PRNs) with individual seeds and methods. However, with pre- 
vious methods being well established and requiring less CPU time (in particular, when one requires 
to control many streams simultaneously), not all users will see an immediate need to switch to 
this new approach. Also, previous methods have widely been used and provided the basis for a 
substantial body of literature in many different fields. This section, therefore, only addresses the 
“traditional” versions. Unfortunately, the functions rand and randn are opaque. Hence, no de- 
tailed comments can be made about how a seed is translated into the first PRN—nor can the user 
change it directly and make it suitable for his or her own requirements. 

Meanwhile, MATLAB is discouraging these syntaxes. Instead, they have introduced a function 
rng to control the random number generation, regardless of the distribution. 


rng(s, ‘twister’) sets the seed when the RNG is based on the twister. Other generators 
are available; the help file has more details. 

rng(’shuffle’) picks an arbitrary seed. This is particularly helpful when you need indepen- 
dent simulations right after starting MATLAB! or if you want to go back to independent 
experiments after setting the seed. 


To see why previous versions were not perfect, let’s have a look at some of the outdated, yet still 
working alternatives. 


6.8.3 Uniform random numbers from MATLAB’s rand function 


When performing a sequence of Monte Carlo experiments, it is often desirable that individual ex- 
periments be replicable and independent of the previous ones. Therefore, a simple approach would 
be to set the seed for the first experiment to some value sı and then just increase the seed by a fixed 


13. All descriptions of MATLAB’s RNGs are based on the documentation and help files for their 2008a and 2010a ver- 
sions, which also provide details about and references for the different RNG algorithms in use. Help files and technical 
documentations are available online on www.mathworks.com. 

14. By default, MATLAB always starts with the same initial seed. To see the effect, start MATLAB, call rand or any other 
RNG function, close and restart and call the same RNG function again. 
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FIGURE 6.8 First uniform PRN drawn for different seeds and RNGs. 
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FIGURE 6.9 Scatterplots for first uniform PRNs drawn from seeds s and s + 1. 


amount d, s; = s;_; + d, over experiments i. All this is under the—not unreasonable—assumption 
that different seeds always produce different and independent first draws. 

Fig. 6.8 depicts x1,s, the first PRNs drawn after the seed has been set to s = 1, ..., 1000. Ob- 
viously, ‘seed’ produces patterns, and in ‘state’, clusters of samples with low values of x1, 
appear to be generated only in certain regions of s: in the “southern” parts of the graph, clusters 
and gaps seem to alternate in regular intervals. This, however, is not particularly pronounced. 

Plotting the first random numbers from subsequent seeds, (x1,s, X1,s+1), yields no surprises 
(Fig. 6.9): only the congruential RNG exhibits a linear relationship, whereas for the other two, no 
obvious pattern emerges. Seeding seems to be an issue mainly for the congruential RNG, which has 
long been supplemented with better alternatives. This also applies to RNGs for other distributions 
that use uniform PRNs, for example, via the inversion method. Only if a user is unaware that the 
command rand (' seed’ , s) not only sets the seed but also switches to this unfavorable RNG, 
this could be a problem. 


6.8.4 Gaussian random numbers from MATLAB’s randn function 


For both RNGs for normal variates, things look completely different. For either method, one can 
easily spot patterns and dependencies between seed and first draw. Fig. 6.10 contains a scatterplot 
for seeds s = 1,..., 10,000 and the first draw, x1,s for the polar RNG. This algorithm, invoked 
with randn(’seed’,s), was the default in MATLAB version 4. It exhibits patterns where, for 
example, every 1200 seeds or so (at least) one value above 3 and one below —3 is produced. 

From MATLAB version 5 onwards, Marsaglia and Tsang’s ziggurat algorithm is the default 
RNG for standard normal variates; the seed can be set by randn(’ state’, s). For this method, 
there are also clear patterns. Most obviously, there are long stretches of numbers with equal sign as 
can be seen from Fig. 6.11. All seeds in the range s = 1,..., 447 produce x1,s > 0. For the following 
few seeds, the numbers are closely distributed around 0, and for s = 464, ..., 8383, only negative 
X1,s58 are generated; the same is true for s = 8448, ..., 8704. For seeds s = 8704, ..., 16,383, on 
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randn (‘seed’,s) 
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FIGURE 6.10 First normal PRN (polar) drawn for seeds s = 1,..., 10,000. 
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FIGURE 6.11 First normal PRN (ziggurat) drawn for seeds s = 1, ..., 600,000. 
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FIGURE 6.12 Scatterplot first draws from three subsequent seeds s = 1,..., 5000. 


the other hand, not a single of the corresponding 7640 x, ss is negative. Within these long stretches 
of numbers with equal sign, the PRNs appear to fall within triangular shapes, implying that, for 
example, x1,s > 3 (or x1,; < —3 for that matter) tends to occur in small regions. For higher seeds, 
these patterns sometimes get slightly more ragged; the main issues, however, persist. 

In either case, this has implications on first draws when using subsequent seeds (Fig. 6.12). 
Increasing the intervals when picking seeds does not remedy the problem. Fig. 6.13 provides ex- 
emplary scatterplots for the first draws in which seeds are picked with step sizes of 100, 1000, and 
10,000, respectively. Other step sizes provide similar patterns. 

It must be emphasized again that this is not a problem of the respective RNG itself, but of how a 
provided seed s is translated into the first draw. This has consequences for controlled experiments: 
if in a series of (supposedly) independent runs or in subsequent iterations, the first step is to system- 
atically set the seed for the normal RNG (or, with the argument ' seed’, the uniform RNG), then 
their first normal PRNs can exhibit patterns and dependencies. If the first PRN picked is a relevant 
state variable, this can cause bias problems. 
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FIGURE 6.13 First PRN drawn for seeds s = 1, . . . , 1000, x1, plotted against x) s+q with d = 100, 1000 and 10,000 (left 
to right) for the polar (top panel) and ziggurat (bottom panel) RNGs. 


rand(‘seed’,s); rand(‘state’,s); rand(‘twister’,s); 
rand(100,1); rand(100,1); rand(100,1); 
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FIGURE 6.14 101st PRN drawn for seeds s = 1,..., 1000, x101,s for the uniform (top panel) and normal (bottom panel) 
RNGs. 


6.8.5 Remedies 


The best way to avoid these systematic patterns is by abandoning these syntaxes and using the rng 
function instead. 

However, if you are working with older versions of MATLAB, here are some suggestions how to 
soften the problem. Looking at later draws, these effects seem to vanish because the independence 
properties of the RNG itself take over. Hence, a quick fix for this problem could be to draw n 
blank PRNs immediately after resetting the seed. This allows the RNG to burn in, and subsequent 
PRNs should be as well behaved as one can expect from this RNG. If n is prespecified and kept 
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constant, this also preserves replicability. For example, after choosing n = 100, Fig. 6.14 contains 
the scatterplots for seeds and 101st draws for the three uniform and two normal RNGs. Comparing 
them to the corresponding graphs in Figs. 6.8, 6.10, and 6.11 shows that the patterns have vanished; 
the only exception being the “usual suspect,” that is, the congruential uniform RNG. 


Chapter 7 


Modeling dependencies 
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7.1 Transformation methods 
7.1.1 Linear correlation 


For many applications we need random variates that are dependent in a predetermined way. Our first 
aim is to generate a matrix X of size N x p. For intuition, think of X as a sample of N observations 
of the returns of p assets. The variates in a given column of X should follow specific distributions 
(i.e., the marginal distributions of the specific asset), and the columns of X should be correlated. 

The most-used measure of dependence is linear correlation. Assume two random variables Y 
and Z. Then for a sample of N paired observations (y1, z1), (y2, Z2),---, (yn, ZN) linear correlation 
p is computed as 


= Oi — my)(zi — mz) . 


SYSZ 


PY,Z (7.1) 
The variables m and s are the sample means and standard deviations, respectively. Linear cor- 
relation is invariant to linear transformations: changing two random variables into a, + b,Y and 
a2 + b2Z will not change the linear correlation between them as long as bı and bz have the same 
sign (if they are of opposite sign, the sign of p will be reversed). 

Now we have not just two but p random variables. It is convenient to treat them as a random 
vector Y of length p. To create X, we need to draw N times such a vector Y. Suppose Y should be 
distributed as 


Y~N(u, £). 


Here u is the vector of means with length p, and X is the p x p variance—covariance matrix. 
We start with a vector Y of i.i.d. standard Gaussian variates, so u is a vector of zeros and & is 
the identity matrix of size p. The MATLAB® command randn creates a whole matrix, that is, 
N draws of Y, in one step. 


oe 


create uncorrelated Gaussian variates 


p= Ss % number of assets 
N = 500; % number of obs 

X = randn(N, p); 

% check 


plotmatrix(X); corrcoef (X) 
R’s function rnorm always returns a vector. We can write a function that acts like randn. 


Listing 7.1: C-ModelingDependencies/R/./Ch07/randn.R 


## randn.R -- version 2010-12-10 
randn <- function(m, n) 
3 array(rnorm(m*n), dim = c(m, n)) 


Ne 
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We scale the columns of X to have exactly zero mean and unit variance. This is not necessary, 
but it is most of the times harmless and convenient!: 


(i) If we transform a scalar Gaussian random variable Y with mean u and variance o? into 
a + bY, its mean will be u + a, and its variance will be b?o?. Thus we can later on always 
enforce the desired means and variances. 

(ii) Linear correlation (in which we are interested here) is invariant to such linear transformations. 
So we can first make the columns of X be correlated as desired, and then later change the 
means and variances. 

(iii) The rescaling simplifies computations: the correlation matrix is now equal to the variance- 
covariance matrix and can be computed as 4X 'X. 


To generate correlated variates, we need two results. First, every variance—covariance matrix © is 
symmetric and real-valued. For now assume that the matrix is also positive-definite. If the matrix 
were semidefinite, it would not have full rank; this case is discussed below. In some pathological 
cases the matrix can also be indefinite; see page 368. Such a symmetric, real, and positive-definite 
matrix can always be decomposed into 


S=LDL'=LVDV/DL' 


where L is a unit lower triangular matrix (i.e., it has ones on its main diagonal) and D is a diagonal 
matrix with strictly positive elements. vD means that we take the square root of each diagonal 
element of D (which is always possible since all elements on the main diagonal of D are strictly 
positive). The matrix L/D is a lower triangular matrix; a convenient choice to compute it is the 
Cholesky factorization.” 

The second result is the following: suppose we generate a vector Y of uncorrelated Gaussian 
variates, that is, Y ~ N(O, Z). Whenever we premultiply such a vector by a matrix B and add to the 
product a vector A, the resulting vector is distributed as follows: 


BY +A~N(A, BIB’). 
So, let B = L/D, then we get 
BIB! = BB' =LVDVDL'=*. 


Thus, we obtain the desired result by premultiplying the (column) vector of uncorrelated random 
variates by the Cholesky factor. We want to create not only one vector Y, but a whole matrix of 
N observations, that is, each row in X is one realization of Y, so we postmultiply the whole matrix 
by B’ (i.e., the upper triangular matrix): 


X°= XB’. 


The columns of X° are correlated as desired. Let us go through these steps with MATLAB (see the 
script Gaussian2.m). We start with the matrix X. 


Listing 7.2: C-ModelingDependencies/M/./Ch07/Gaussian2.m 


l|% Gaussian2.m -- version 2011-01-16 
2\ p> = 33 % number of assets 

3N = 500; % number of obs 

4)X = randn(N,p); 

5)X = X * (diag(1./std(X))); 

6)/X = X - ones(N,1)*mean(X) ; 

7|% check 

8| plotmatrix(X); corrcoef (X) 


. The variance of variance across repeated samples will, of course, be zero. 
. The Cholesky factorization is explained in detail in Section 3.1.3. 


Ne 
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Next we set up a correlation matrix. MATLAB and R store matrices columnwise, and elements 
can also be addressed like in a stacked vector. In both MATLAB and R, the Cholesky factor can 
be computed with the command chol; note that both MATLAB and R return upper triangular 
matrices. 


10 
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Listing 7.3: C-ModelingDependencies/M/./Ch07/Gaussian2.m 


%% induce linear correlation 
rho = 0.:7; % correlation between any two assets 


% set correlation matrix 
M = ones(p,p) * rho; 
M(1 : (p+1) : (p*p)) = 1; 


% compute cholesky factor 
C = chol(M); 


% induce correlation, check 
Xc = KX * C; 
plotmatrix(Xc); corrcoef (Xc) 


We can check the results by comparing the scatter plots of the columns of X and Xc, and by 


computing the correlation. The result of a call to MATLAB’s plotmatrix with p=3 and N = 
200 is shown in Fig. 7.1. The script Gaussian2 .R shows the computations in R. 


-3 0 3 3 #O 3 3 #O 3 -3 0 3 -3 0 3 -3 0 3 


FIGURE 7.1 Left: scatter plot of three uncorrelated Gaussian variates. Right: scatter plot of three Gaussian variates with 
p=0.7. 
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Listing 7.4: C-ModelingDependencies/R/./Ch07/Gaussian2.R 


# Gaussian2.R -- version 2011-05-15 
p <- 5 # number of assets 
N <- 500 # number of obs 

rho <- 0.5 # correlation between two assets 


## create uncorrelated observations 
X <- rnorm(N * p); dim(X) <- c(N, p) 


## check (see ?pairs) 
panel.hist <- function(x, ...) { 
usr <- par ("usr"); on.exit (par (usr)) 
par(usr = c(usr[1:2], 0, 1.5)) 
h <- hist(x, plot = FALSE) 
breaks <- hSbreaks; nB <- length (breaks) 
y <- h$counts; y <- y/max(y) 
rect (breaks[-nB], 0, breaks[-1], y, ...) 
} 
pairs(X, xlim = c(-5,5), ylim = c(-5,5), labels = NA, 
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19 diag.panel = panel.hist, col = grey(0.4)) 
20| cor (X) 

21 

22| ## set correlation matrix 

23|M <- array(rho, dim = c(p, p)); diag(M) <- 1 

24 


25| ## induce correlation, check 

26|C <- chol(M); Xc <- X %*% C 

27| pairs (Xc, xlim = c(-5, 5), ylim = c(-5, 5), labels = NA, 
28 diag.panel = panel.hist, col = grey(0.4)) 

29| cor (Xc) 


What if È does not have full rank? In MATLAB, we can check the rank of Xc with the command 
rank. In R, we can use qr (Xc) $rank or the function rankMatrix from the Matrix package 
(Bates and Maechler, 2018). Here we stay with the MATLAB example, so we type 


rank (xc) 
and get 
5 


As a test, we replace the pth column of Xc with a linear combination of the other columns. 


Listing 7.5: C-ModelingDependencies/M/./Ch07/Gaussian2.m 
26|Xe(:,p) = Xc(:,1:(p-1))*rand(p-1,1); 


Then, as expected, the call 
rank (Xc) 

will give us a 
4 


A correlation matrix is at its heart the cross-product of the data matrix X. The rank of X’X can at 
most be the column rank of X (mathematically it will be the same rank; numerically X’X could be 
of lower rank than X because of finite precision). Hence if X is rank deficient so is the correlation 
matrix. It is worth checking the scatter plots of the rank-deficient matrix Xc. It is not at all obvious 
that we have a redundant asset. 


Listing 7.6: C-ModelingDependencies/M/./Ch07/Gaussian2.m 


28) % check 

29| plotmatrix(Xc); corrcoef (Xc) 
30)M = corrcoef (Xc); 

31| rank (M) 


The Cholesky factorization requires full rank: 


Listing 7.7: C-ModelingDependencies/M/./Ch07/Gaussian2.m 


32| cho1 (M) 


will (most of the time) result in 


??? Error using ==> chol 
Matrix must be positive definite. 


(Just most of the time: in some cases MATLAB may not give an error even though the matrix is not 
full rank.) 

The theoretically best but often impractical approach is to check why there is rank deficiency. 
Sometimes, we can work with a reduced matrix. In our example, we know that the pth asset does 
not really have its own “stochastic driver,” and hence we could compute its return as a combination 
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of the returns of assets 1 to p — 1 (we could save a random variate). If that is not possible, we can 
instead think about the decomposition of & that we used. We required that 


BB’=> 


and the Cholesky factor was a convenient choice for B. But there are decompositions that do not 
require that & have full rank. One such alternative is the eigenvalue decomposition: 


E=VAV'. (7.2) 


The p x p matrix V has in its columns the eigenvectors of X; A is diagonal and has as elements the 
eigenvalues of £. Since © is symmetric, the columns of V will be orthonormal, hence V’V = J, 
implying that V’ = V~!. (For a nonsymmetric matrix, we cannot just transpose V in Eq. (7.2).) 
Since X is nonnegative-definite, the eigenvalues cannot be smaller than zero. Hence, we can write 
Aas VAVA (with the root taken element-wise), and so get another symmetric decomposition. To 
induce correlations, just set B = VVA. 


Listing 7.8: C-ModelingDependencies/M/./Ch07/Gaussian2.m 


34| %% eigen decomposition 

35| [V,D] = eig(M); 

36| C = real (V»sqąrt(D)); 

37e Ses 

38| Xcc = X * C; 

39| plotmatrix(Xcc); corrcoef (Xcc) 


In fact, we can also use the SVD (see page 37). The SVD decomposes a rectangular matrix X into 
USV’. 


Recall that we have scaled X so that each column has exactly zero mean, and unit standard de- 
viation. In this case, the V in the eigenvalue decomposition and the SVD are the same—up to 
numerical precision, sorting, and sign; note that the MATLAB help suggests 

[U,S,V] = svd(X) 
for the SVD, and 

[V,D] = eig(A) 


for the eigenvalue decomposition—the V in both cases is no coincidence. We have: 


1 1 1 fyri / 
E= —X'X = —VS'U'USV'. 
N N 


The U and V matrices are orthonormal, that is, U'U = I and V’V = I. Hence we are left with 
1 1 


—X'X = —VS'SV' 
N N 


and PN S = A. That is, the squared singular values of X are the eigenvalues of X’ X. 


Listing 7.9: C-ModelingDependencies/M/./Ch07/Gaussian2.m 


41| %% eigen v. svd 
42|% eigen decomposition 
43|M = corrcoef (X); 


44| [V1,D] = eig(M); 

45| C = real(V1l«*sqrt(D)); 
46|C = C’; 

47 


48|% svd 
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49) [U,S,V2] = svd(X); 

50 

51] % ratio of sing values squared to eigenvalues 
52| ((diag(S).*2)/(N-1)) ./ sort (diag(D),’descend’ ) 


7.1.2 Rank correlation 


Linear correlation has a number of disadvantages: it may not capture certain nonlinear relationships, 
and it may make no sense at all for certain distributions. In fact, while it is true that correlation is 
bounded between —1 and +1, for many distributions these bounds are far tighter. Embrechts et 
al. (1999) give, as an example, the lognormal distribution. Lognormal variates can be obtained 
by creating Gaussian variates Z, and then transforming them with exp(Z). Here is a complete 
example: 


Listing 7.10: C-ModelingDependencies/M/./Ch07/lognormals.m 


% lognormals.m -- version 2010-01-08 
%% uncorrelated Gaussian variates 

p = 3; N = 200; X = randn(N,p); 

X = X *« (diag(1./std(X))); 

X = X - ones(N,1)*mean(X) ; 

figure (1) 

plotmatrix(X); corrcoef (X) 


%% induce linear correlation 
rho = 0.5; M = ones(p,p) * rho; 


M(1:(p + 1):(p * p)) = 1; 
C = chol(M); Xc = X *« C; 
figure(2) 


plotmatrix(Xc); corrcoef (Xc) 


%% make exp 

Z = exp(Xc); 

figure (3) 

plotmatrix(Z); corrcoef (Z) 


oon 
OMANDMNPWNF TOAANIADUNHWNKE 


When we compute the correlation of Xc: 
corrcoef (Xc) 
we get indeed: 


ans 
0000 0.5073 0.5093 
<5O73 1.0000 0.4763 
-5093 0.4763 1.0000 


oor il 


But for the lognormals Z we get correlations like 


ans 
0000 0.3458 0.4098 
.3458 1.0000 0.4131 
.4098 0.4131 1.0000 


oo F Il 


So the correlations have shrunk. 


Example 7.1 


For lognormal variates, the attainable linear correlation is a function of the variances of the normals. Let 
Yı and Y, follow a Gaussian distribution and be linearly correlated with p, then the linear correlation 


20 
21 
22 
23 
24 
25 
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between the associated lognormals can be computed analytically: 


e710 _ | 


ye- IDe? — 1) 


What is more, if log(Y1) has a Gaussian distribution with zero mean and unit variance, and log(Y2) has 
a Gaussian distribution with zero mean and variance o°, then (McNeil et al., 2005, Chapter 5) 


linear correlation(e”!, e”) = 


e 7-1 ey —1 
Pmin = and Pmax = 


(@— De” —1 Ve- De” -D 
As a test, we increase the standard deviation of the first column of Xc to 5: 


Listing 7.11: C-ModelingDependencies/M/./Ch07/lognormals.m 


%% change variance of Xc 


sd = 5; 

Xc(:,1) = sd*Xc(:,1); 
Z = exp(Xc); 

figure (4) 


plotmatrix(Z); corrcoef (Z) 


We get a correlation matrix like the following: 


ans = 
1.0000 0.0130 0.0551 
0.0130 1.0000 0.4131 
0.0551 0.4131 1.0000 


Thus, for certain distributions, linear correlation is not an appropriate choice to measure co- 


movement. There are alternatives to linear correlation: we can use rank correlation. 


The best-known rank correlation coefficient is that of Spearman. To compute Spearman corre- 


lation p5 between Y and Z, we replace the observations y; and z; by their ranks; then we can use 
Eq. (7.1). Spearman correlation is sometimes also defined as the linear correlation between Fy (Y) 
and F7(Z) where Fç) are the distribution functions of the random variables. This maps the realiza- 
tions into (0, 1); it is equivalent to the ranking approach in the population but not in the sample. To 
compute ranks with R, we can use the function rank. 


WN Re 


ADM 


Listing 7.12: C-ModelingDependencies/R/./Ch07/Spearman.R 


Y <- rnorm(20) 
Z <- rnorm(20) 
cor (Y, Z, method = "spearman") 


ranksY <- rank(Y) 
ranksZ <- rank(Z) 
cor(ranksY, ranksZ, method = "pearson") 


Ranking the elements of a vector with MATLAB is not so straightforward. Here is a small 


example. We have a vector Y, and we want to obtain the ranks, given in the column “ranks of Y.” 


Y sorted Y ranks of Y 
4.9 0.0 5 
3.2 1.5 4 
0.0 2.7 1 
2.7 3.2 3 
1.5 4.9 2 
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The MATLAB function sort returns a sorted vector and (optionally) a vector of indices. These 
indices are the sorting order for the original vector. Hence if 


[sortedY,indexY] = sort (Y) 


we have sortedy is the same as Y(indexyY). In the example, the index vector would be 
[3 5 4 2 1]’. (As a side note, such indexes can be used to create permutations of vectors; see 
page 118.) We want ranks, not indexes. So we need the indexes of the sorted indexes; see the 
following MATLAB code. 


Listing 7.13: C-ModelingDependencies/M/./Ch07/Spearman.m 


l|% spearman.m version -- 2011-01-10 
2|/¥ = yvandn (20,1); 

3|}Z = randn(20,1); 

4| corr (Y,Z, ‘type’, ‘Spearman’ ) 

5|% 

6| [ignore, indexY] = sort(Y); 

7| [ignore, indexZ] = sort(Z); 

8| [ignore, ranksY] = sort (indexyY); 

9| [ignore, ranksZ] = sort (indexZ) ; 

0 


= 


corr (ranksY,ranksZ, ‘type’, ’Pearson’ ) 


This only works if the elements in Y are all distinct, that is, there are no ties. In MATLAB’s 
Statistics Toolbox, the function tiedrank computes average ranks for cases with ties. R’s rank 
also handles ties correctly. 

Spearman correlation has a more general invariance property than linear correlation. Since it 
only uses ranks, it does not change under monotonically increasing transformations. Since dis- 
tribution functions and their inverses have this property, the rank correlation stays the same. 
Try: 


X = randn(10000,2); 

M = [1 0.7; 0.7 1]; 

C = chol(M); X = Xet: 

corr(X, ‘type’, ‘Spearman’ ) 
corr(exp(X), ‘type’, ’Spearman’ ) % 
corr (normcdf (X) ,'type','Spearman') % 


a monotone transformation 
a monotone transformation 

But how can we induce rank correlation between variates with specified marginal distributions? 
We need a sample of uniforms with a given rank correlation, then we can use the inversion method 
(Section 6.3.1). What we know is how to generate a sample of Gaussians with a specified linear 
correlation. It turns out this is all we need, since in the Gaussian case there exist explicit relations 
between rank and linear correlation (Hotelling and Pabst, 1936, McNeil et al., 2005): 


p linear = 2 sin(/6 p°) , p$ = 5/x arcsin(?/2) (7.3) 
Algorithm 22 describes a procedure to create a random vector Y with marginal distribution F 


and rank correlation matrix ©'"*, An example, creating lognormals with a rank correlation of 0.9, 
follows. 


3. An alternative rank correlation measure is Kendall’s t, defined as follows: 


t = prob ((Y; — Y;)(Z; — Zj) > 0) — prob ((Y; — Y;)(Z; — Z;) <0). 


prob(.) stands for “probability that (-).” For the Gaussian case, we also have an explicit relation between linear correlation 
and Kendall’s t: 


piinear — sin(1/21), T = 2/n arcsinp . 
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Algorithm 22 Generate variates with specific rank correlation. 


set Dk (rank correlation) 

compute corresponding © (linear correlation) with Eq. (7.3) 

generate Gaussian variates Z with linear correlation £ 

compute uniforms U = FGaussian(Z) (which preserves rank correlation) 
compute Y = F -1 (U) (which preserves rank correlation) 


Ww” N e 
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Listing 7.14: C-ModelingDependencies/M/./Ch07/exRankcor.m 


% exRankcor.m -- version 2010-01-09 
%% example rank correlation 

p = 3; N = 1000; 
X = randn(N,p); 
X 
X 


7 


II 


X » (diag(1./std(X))) 
= X - ones(N,1)*mean(X); 
plotmatrix(X); corrcoef (X) 
%% induce rank correlation: Spearman 

rhoS = 0.9; % correlation between any two assets 


% set rank correlation matrix 
Mrank = ones(p,p) * rhoS; 
Mrank(1:(p + 1):(p * p)) = 1; 


% compute corresponding linear correlation matrix 
M = 2*sin(pi/6.*Mrank) ; 


% compute cholesky factor 
C = chol(M); 


% induce correlation, check 
Xc = X * C; 
plotmatrix(Xc); 


% check 
corr (Xc, 'type','Pearson') 
corr (Xc, 'type','Spearman') 


sd = 5; Xc(:,1) = sd * Xec(:,1); 
Z = exp(Xc); 

% check 

corr (Z, ‘type’, ’ Pearson’ ) 
corr(Z,'type’, ‘Spearman’ ) 


The interesting bit happens in lines 30-34. exp () is a monotonous transformation, so the rank 


correlation remains. In fact, the inverse of the lognormal is exp(F, -1 n). The linear correlation of 
the lognormals is reduced as before: 


Gaussian 


ans = 
1.0000 0.4516 0.4849 
0.4516 1.0000 0.8532 
0.4849 0.8532 1.0000 


But the rank correlation stays where it is. 


ans = 
1.0000 0.9037 0.9031 
0.9037 1.0000 0.9047 
0.9031 0.9047 1.0000 


In fact, for Spearman correlation we would not really have needed the adjustment in Eq. (7.3) since 
the maximum difference between p and p’ in the Gaussian case is less than 0.02. So when we 
compare the MATLAB scripts lognormals.mand exRankcor .m, we have done nothing much 
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different compared with the Gaussian case; if you look at the scatter plots, you find that they may 
still look awkward because of the right tails of the lognormal. The only thing that is different now 
is how we measure correlation, the actual results are almost the same. 


As another example, we create rank-correlated triangular variates T. Such variates are often 


used in decision modeling since they only require the modeler to specify a range of possible out- 
comes (Min to Max) and the most likely outcome Mode. Triangular variates T can be simulated in 
a number of ways (Devroye, 1986). One possibility is 


T = Mode + (Min + U1 (Max — Min) — Mode)/U2, 


or we could use 


T = Mode + (Min + Uj (Max — Min) — Mode) max(U}, U3); 


see Devroye (1996). The U; are uniform variates. The R script tria.R implements both variants. 


on 
OWAADMAWNrFTOANANADUNHRWNKE 


N 
© 
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Listing 7.15: C-ModelingDependencies/R//Ch07/tria.R 


## tria.R -- version 2010-12-16 
## variant 1 
Min <- 0; Max <- 3; Mode <- 0.75 
rtrial <- function(ul, u2, Min, Max, Mode) { 
Mode + (Min + ul * (Max - Min) - Mode) * sqrt(u2) 
} 
trials <- 1000 
T <- rtrial(runif(trials), runif(trials), Min, Max, Mode) 
hist (T) 


## variant 2 
Min <- 0; Max <- 3; Mode <- 0.75 
rtria2 <- function(ul, u2, u3, Min, Max, Mode) { 
Mode + (Min + ul * (Max - Min) - Mode) * pmax(u2, u3) 
} 
trials <- 1000 
T <- rtria2(runif(trials), runif (trials), runif(trials), 
Min, Max, Mode) 
hist(T, breaks = 100) 


We can also use the inverse of the triangular distribution. 


Listing 7.16: C-ModelingDependencies/R/./Ch07/tria.R 


## inverse 
triaInverse <- function(u, min, max, mode) { 
range <- max - min 
ifelse(u <= (mođe-min)/ (range), 
min + sqrt ( u * range * (mode - min)), 
max - sqrt((l-u) * range * (max - mode) )) 


} 


trials <- 1000 

u <- runif (trials) 

Min <- 0; Max <- 3; Mode <- 0.75 

system.time(T <- triaInverse(u, Min, Max, Mode) ) 
hist (T) 


All variants could be improved. The command pmax (x, y), for instance, could be replaced by 


x+y 
2 


x—y 
2 
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which is often faster. Likewise, the result of ifelse can often be obtained faster by directly 
evaluating the logical expression. Such ideas, of course, provide speed at the cost of obscuring the 
code. The next program creates triangular variates with a Spearman rank correlation of 0.7. 
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Listing 7.17: C-ModelingDependencies/R/./Ch07/tria.R 


## correlation 

N <- 1000; p <- 4 

rho <- 0.7; rho <- 2 * sin(rho * pi / 6) ## spearman 
X <- array(rnorm(N * p), dim = c(N, p)) 
C <- matrix(rho, nrow = p, ncol = p); 
diag(C) <- 1; C <- chol(C) 


X <- X S*3 C 
cor (X, method = "spearman") 


U <- pnorm(X) 
T <- triaInverse(U, Min, Max, Mode) 


## graphic (see ?pairs) 
panel.hist <- function(x) { 
usr <- par ("usr"); on.exit (par (usr)) 
par(usr = c(usr[1:2], 0, 1.5) ) 
h <- hist(x, plot = FALSE) 
breaks <- hSbreaks; nB <- length (breaks) 
y <- hScounts; y <- y/max(y) 
rect (breaks[-nB], 0, breaks[-1], y, col=grey(.5)) 
} 
par(las = 1, mar = c(2,2,0.5,0.5), ps = 10, tek = 0.01, 
mop =:¢(3, 0.2, 0), pch = “.") 
pairs(T, diag.panel = panel.hist,gap=0) 


In case we ever need it, we could also create uniforms with a given linear correlation as specified 


in a matrix ÈX. In the population (or in large samples), 


p? z = 0'"™ (Fy (Y), Fz(Z)), (7.4) 


for two random variables Y and Z. That is, the linear correlation between the uniforms obtained 
from transforming the original variates equals the Spearman correlation between the original vari- 
ates. We set up the desired linear correlation matrix £; next we need to generate Gaussian Y and Z 
with Spearman correlation D'*"*, By Eq. (7.4), this will be the linear correlation for the uniforms. 


The following MATLAB script creates 1000 realizations of four correlated random variates, 


where the first two variates have a Gaussian distribution and the other two are uniformly distributed. 


= 
SCOmANAUNHKWNHKE 


a 
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Listing 7.18: C-ModelingDependencies/M/./Ch07/exUniforms.m 


% exUniforms.m -- version 2011-01-07 
% generate normals, check correlations 
X = randn(1000,4); 
corrcoef (X) 
% desired linear correlation 
Mefl.0 0.7 0.6 0.6% 
0.7 2. 0.6 0463 
0.6 0.6 2.0 0.8; 
0.6 0.6 0.8 1.0]; 


% adjust correlations for uniforms 
M = 2 x sin(pi/6 .* M); 


% induce correlation, check correlations 
C = chol(M); 
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17|Xc = X * C; 
18| corrcoef (Xc) 


19 

20|% create uniforms, check correlations 
21|Xc(:,3:4) = normcdf(Xc(:,3:4)); 

22| corrcoef (Xc) 

23 


24|% plot results (marginals) 
25| for i=1:4 


26 subplot (2,2,i); 

27 hist (Xc(:,i)) 

28 title([’xX ’, int2str(i)]) 
29| end 


As a final example, assume we have samples of returns of two assets, collected in vectors Yı and 
Y2, but assume they are not synchronous; they could even be of different length. We still can induce 
rank correlation between these empirical distributions and sample from them. 


Listing 7.19: C-ModelingDependencies/M/./Ch07/exEmpCor.m 


1] % exEmpCor.m -- version 2011-01-04 
2| NO = 200; % empirical number of obs 
3) Y1 = randn(NO,1); 

4| Y2 = randn(NO,1); Y2(Y¥2<0.3) = -0.5; % make Y2 non-Gaussian 
5 

6|% create CDFs 

7| sortedY1l = sort(Y1); 

8| sortedY2 = sort (Y2); 

9 

10|% resample 

11N = 1000; % (re)sample size 

12|Z = randn(N,2); rho = 0.9; 

13M = [1 rho; rho 1]; Z = Z»xchol (M); 
14| U = normcdf(Z); U = ceil (NO*U); 

15 


16| check 
17| corrcoef (Y1,Y2) 
18| corrcoef (sortedY1(U(:,1)),sortedY2 (U(:,2))) 


20|% histograms and scatter of original Y1 and Y2 
21| subplot (231), hist(Y1) 

22| subplot (232), hist (Y2) 

23| subplot (233),scatter(Y1(U(:,1)),Y2(U(:,2))) 


25|% histograms and scatter of resampled/correlated Y1 and Y2 
26| subplot (234), hist (Y1(U(:,1))) 

27| subplot (235), hist (Y¥2(U(:,2))) 

28) subplot (236) ,scatter(sortedY1(U(:,1)),sortedy2(U(:,2))) 


The plots (not displayed in the book) show that the marginal distributions stay the same, but the 
joint distribution now shows strong comovement. In the following sections we will discuss methods 
that give us more control over the joint distribution of random variables. 


7.2 Markov chains 
7.2.1 Concepts 


In a Markov chain, the probability for the next state or value of a variate depends on the current 
state. This approach is particularly useful for time series with conditional probabilities, and Markov 
Chain Monte Carlo (MCMC) plays a very prominent role in financial simulations. 

In its simplest version, there is only a limited set of alternatives, and the probabilities for the 
next state of a variable only depend on its current state. For example, consider a bond that can be 
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rated either “investment grade” (IG state 1) or “high yield” (HY state 2). With a low probability, the 
rating will change, but most likely, it will remain the same. In other words, the rating in the next 
period highly depends on its current rating, because upgrading or downgrading usually happens 
only with a low probability. For example, prob(x;+1 = /G|x; = IG) > prob(xj+1 = IG|x; = HY). 
The conditional probabilities can be collected in a transition matrix IT: 


= _ | PIGWIG  PHY\IG 
l= ee ea À 
[Pxisilxi] b | 


Given the transition matrix, sampling is rather straightforward. The rows in the transition matrix 
provide the conditional probabilities for a given current state x;; for example, II = (3 aol means 
that, for either state, there is a 90% chance that the next state will be the same and a 10% chance 
that the process will switch. Sampling the subsequent state x;+ı can then be done with a roulette 


wheel principle (see Section 6.5.2). 
Listing 7.20: C-ModelingDependencies/M/./Ch07/DiscreteMC.m 


l| function x = DiscreteMC(prob,N,x_0) 

2|% DiscreteMC.m -- version 2011-01-06 

3) % discrete Markov Chain with N samples 
4|% prob: transition variables 

5| if nargin < 3 

6 uc_prob = prob^50; 

7 x_0 = RouletteWheel (uc_prob(1,:),1); 
8| end 


9|x = zeros(N+1,1); 

10}x(1) = x_0; 

11) fer i = 1:N 
x(i+1) = RouletteWheel (prob(x(i),:),1); 

13| end 


If there are nx different states and px,,, |x; = 1/nx, there will be no serial dependence. This changes, 
however, if the probabilities are state dependent. For TI = | 93 9-} ], state 1 is likely to be followed 
by state 1; once state 2 has been sampled, the process will never return to state 1. In this case, state 2 
is called an “absorbing state.” Alternatively, a transition matrix I = es oaa] will produce heavy 
oscillation between states, though, again, there is a higher chance that the system is in state 2. To 
get the conditional probabilities for the state for the period after the next, the transition matrix must 
be multiplied by itself, px; ,5\x; = TI?; for the rth future period, px; ,,|x; = II‘. Note that the larger t 
becomes, the closer these probabilities get to the unconditional probabilities for the different states. 
Note that I] must be a square ny x ny matrix, and the rows must add up to 1. 

This principle can be generalized for joint densities p,y with conditional distributions p,|y and 
Py|x. The simulation procedure is just a variant of the univariate case. An important prerequisite 
here is that at least one of the variates depends on the lagged value. If, say, x;41 depends on y;, but 
yj+1 is conditioned on x;+1, one has to bear in mind the sequence for drawing the samples. Also, 
if x and y can assume a different number of states, ny and ny, respectively, then the dimensions for 


II, and Iy must be ny x ny and ny X ny, respectively. 


Listing 7.21: C-ModelingDependencies/M/./Ch07/DiscreteMC2.m 


function x = DiscreteMC2 (prob_1,prob_2,N,x_0) 
% DiscreteMC2.m -- version 2011-01-06 

% discrete Markov Chain with N samples 

% prob: transition variables 

if nargin < 4 


uc_prob_1 = ( prob_2 * prob_1 )^50; 

uc_prob_2 = ( prob_1 * prob- -2 )%*50; 

x_0(1) = RouletteWheel (uc_prob_1(1,:),1); 

x_0(2) = RouletteWheel (uc_prob_2(1,:),1); 
end; 


FCoODAIADAMNPWN HE 


ee 


146 PART | II Simulation 


12|x = zeros (N+1,2); 

13|x(1, 2) = x0; 

14| for i = 1:N 

15 x(i+1,1) = RouletteWheel (prob_1(x(i,2),:),1); 
16 x(i+1,2) = RouletteWheel (prob_2(x(i,1),:),1); 
17| end 

18\2(1,2) = [l]; 


Markov chains can also be generalized to continuous distributions;* the geometric Brownian mo- 
tion, often used to model stock price processes, would be one example for this: the new price 
depends on the (realized) previous price plus a (random) price change (see Section 8.3). 


7.2.2 The Metropolis algorithm 


Metropolis et al. (1953) suggest a Monte Carlo method to simulate states for a Boltzmann dis- 
tribution in molecular and atomic systems. Assume the current state (or position) is x; and the 
energy function is E(x;). Then, a random new position is generated within the neighborhood of x;, 
y =x; + uis, where u; is vector of uniform samples u; € [—1, +1] (same dimensions as x;), and s 
is a scalar. This new position is accepted for certain if it has lower energy than x;. But also a higher 
energy state is accepted with a certain probability exp(—Ag/(kT)), where Ag = E(y) — E (xi), T 
is the temperature, and k is the Boltzmann constant. If kT is large (in proportion to Az), chances 
of acceptance, x;+1 := y, are high, otherwise they are low.” 

In terms of the acceptance—rejection method, the idea was to generate a candidate new solu- 
tion, set the density f(Xnew) in proportion to the majorizing density, and use this proportion as an 
acceptance probability. The twist here is that the reference point is the current point’s density: the 
new solution is accepted with probability p = min(1, f (Xnew)/f (x;)). If the new point has a higher 
density than the current one, it is accepted; if its density is lower in proportion to the current one, 
the chance of acceptance is also lower. For a d-dimensional variate x, the pseudocode is given in 
Algorithm 23. 


Algorithm 23 Metropolis algorithm. 
1: initialize xj, n, and s 
2: fori =1:(n—1) do 
3: while x;+ not assigned do 


4 draw z € [0, 1] and u; € [—1, 1]4 

5: Xnew = Xj + Uj: sS 

6: if f(xnew)/f (xi) =z then x34) = Xnew 
7 end while 

8: end for 


This algorithm produces a vector of n variates, and the stationary distribution of the x;s will be 
(proportional to) f. The parameter s governs the maximum step size in any direction. In principle, 
one is rather free in picking a value for s, though there are some general principles. When s is small, 
the current and the suggested value will be very similar—and so will be their densities. Hence, the 
acceptance ratio will be close to 1, and it should not take many attempts to get a new solution. 
But because of the similarity in values, there will be strong serial dependence between the x;s, and 
there can be clusters in some parts of the probability space while no samples from other relevant 
areas exist. Choosing a large s has the exact opposite effect: there will be less serial dependence 
in the x;s, but it will take more attempts to generate a new variate. Also, increasing the number 
of dimensions while keeping s fixed, the size of neighborhood will grow, and the overall step size 
in terms of the Euclidean distance between current and new points will increase. Reducing s with 
the square root of dimensions will counteract this effect.° If serial dependence is a problem and 


4. When speaking about Markov chains, one usually has the discrete version in mind; this, however, is just one special case. 
5. As it turned out later, this concept is also useful in optimization; the Simulated Annealing heuristic is based on this 
algorithm; see Algorithm 45. 

6. See also the remark on calibrating heuristics in footnote 4 on page 538. 
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independent variates are required, shuffling can help (see Section 6.5.3). In either case, one must 
ensure that all new candidates are valid solutions and remain within the support. Finally, to avoid 
problems with unfortunate starting points, it is advisable to allow the process to burn in and discard 
initial samples. 

Like the acceptance—rejection method, the Metropolis algorithm can readily be applied to mul- 
tivariate distributions with known joint density. For an illustration, consider the bivariate normal 
distribution. 


Listing 7.22: C-ModelingDependencies/M/./Ch07/MetropolisMVNormal.m 


l| function x = MetropolisMVNormal (n,d,s,rho) 
2|% MetropolisMVNormal.m -- version 2011-01-06 
3] % generate n standard normal variates 

4| % n .... number of sample 

5|% d .... number of dimensions 

6|% S .... step size for new variate 

7|% rho .. correlation matrix; if scalar all have same corr.) 
8 

9| i£ isscalar (rho) 

10 R = eye(d)»(1-rho) + ones (d)>»rho; 

11} else 

12 R = rho; 

13| end 


14| mu = zeros(1,d); 

15}x = nan(n,d); 

16}x(1,:) = randn(1,d); 
ali 


17| f_i = mvnpdf(x(1,:),mu,R); 

18| for i = 1: (n-1) 

19 while isnan(x(i+1,:)) 

20 x_new = x(i,:) + 2*(rand(1,d)-.5)+*s; 
21 f_new = mvnpdf(x_new,mu,R) ; 
22 if rand < f_new / f_i 

23 x(i+1,:) = x_new; 

24 f_i = f_new; 

25 end 

26 end 

27| end 


Fig. 7.2 depicts the 2000 samples generated for a required correlation of 0.75. The left panel uses 
a step size of s = 0.2, the right one of 1. To show the movements during the process, the x; is 
depicted in black, and subsequent points are increasingly brighter. As can be seen, larger d scatters 
the points more evenly over the probability space, yet it requires more run time. 
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FIGURE 7.2 Bivariate normal samples, generated with the Metropolis algorithm. Left panel: metropolisMVNor- 
mal(2000,2,0.2,0.75) (s=0.2). Right panel: metropolisMVNormal (2000,2,1.0,0.75) (s = 1.0). 


148 PART | II Simulation 


A generalized version of this is the Metropolis—Hastings algorithm (see Hastings, 1970). The 
idea is to simulate x by again drawing candidate solutions and accepting them according to proba- 
bilistic criterion, yet by using a majorizing distribution. Algorithm 24 provides some details. 


Algorithm 24 Metropolis—Hastings. 
1: initialize x; and n 
2: fori =2:ndo 
3: Xi =Xj-1 
4: draw y from density gy; |y; (lx) 
Dx(y) Eriş li Gily) 
Px (xi) By; 4 yly; Q lxi) 
draw u from (0, 1) uniform distribution 
if r >u then x4; =y 
end for 


compute Hastings ratio r = 


9 SL tA 


Hastings (1970) generalized this approach further. Again, a majorizing or candidate-generating 
density g is required, whereas the actual target density is p. As for the previous example, let x; 
be the current solution and y be a new sample in its proximity. The probability of acceptance de- 


é á — Px(y) Sy 4419; (xily) 
pends on the Hastings ratio, r = P) Dib ORD) 


algorithm is that the shape of the target does not need to be known. If the conditional distribution 
is known but not the joint, the Gibbs sampler can be used (see Geman and Geman, 1984). 


. The main advantage of the Metropolis—Hastings 


7.3 Copula models 
7.3.1 Concepts 


The advantage of covariances and linear correlations as measures of dependence is that they are well 
understood and readily accessible. In many financial applications, however, covariances can—at 
best—only approximate the dependencies between two variables. Apart from the problems with 
linear correlation discussed in Section 7.1.1, the dispersion of the individual variables (e.g., returns) 
is rarely described by variance alone, and the marginal distributions (i.e., an individual variable’s 
distribution independent of another variable’s state) are rarely Gaussian. 

A more general approach to model dependencies—either between variates or over time—is the 
use of copulas. Copulas are functions that relate (univariate) marginal CDFs into a joint multivariate 
CDF. The most common type is bivariate copulas; according to Sklar’s theorem, the joint bivariate 
CDF can then be expressed as a function (i.e., the copula, C (-, -)) of the two marginals: 


Fyy(x, y) =C(Fx(x), Fy(Q)). 


This implies that the conditional distribution of Y can be expressed as 


aC (x, 
Fy|x(y|x) = PCE) 


x 


As discussed in Section 6.2, if x has a CDF Fy, then u = Fy (x) will be uniformly distributed; 
and with F~! being the inverse of F, x can be “reconstructed” if u is known: 


Fy(x)=u => Fy'w)=x, wherew~U (0,1). 


This means that it is enough to have copulas that accept uniform marginals, C (u1, u2). For any 
other marginal, the variates can be transformed accordingly. Therefore, the copula can also be seen 
as the joint density of quantiles. 

The simplest and, well, most trivial case is the independence copula: 


C(u, v) = uv, 
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where u and v are independent uniform variates. Though not the most relevant example (after all, 
copulas are supposed to model dependence and not independence), it illustrates several important 
features: copulas model joint probabilities, and they are, therefore, defined within the unit square— 
or m-dimensional unit cube if an m-dimensional copula is considered. Already closer to financial 
applications are Gaussian distributions. If the two marginals are normal, then a Gaussian copula 
links them into a bivariate normal distribution: 


p! (u) $7! (v) 
Cy, (u, v) = i f bot manan = &9(0-'u), 0), 


=00 —oo 


where $,(t1, t2) is the standard bivariate normal PDF with mean 0, standard deviation 1, and cor- 
relation p; ®, is the corresponding multivariate CDF; and 7! is the inverse of the univariate 
standard normal CDF. But this is just one case, and there exist many alternatives. For financial 
applications, asymmetric variants are of particular interest since they can capture real-world tail 
dependencies: stocks tend to crash simultaneously, but extreme positive jumps rarely happen at the 
same time. 

The empirical copula, on the other hand, requires no parametric assumption at all. Using the 
frequency function for a sample of n observations, it can be computed for quantiles uw, = i/n and 
uz = j/n as the number of pairs not exceeding the corresponding order statistics x(;) and yj), 
respectively: 


’ 


nn n 
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A special (and very important) class is Archimedean copulas. The main ingredient here is the 
Archimedean copula generator which is a decreasing, convex function ġ (u) from the unit interval 
u € [0, 1] and produces positive values in R*. Archimedean copulas can be written as 


Clu, v) =" (pu) + 60). 
Here, #'—!1(z) is the pseudoinverse defined by 


@'(z) if0<z<¢(0) 


Hii 
PRN ifz >40), 


which accepts any positive value z and returns values within the range [0, 1]. For (0) = œQ, it is 
equivalent to the “ordinary” inverse, @~'(z). A useful feature of Archimedean copulas is that they 
can easily be expanded to the multivariate case: 


C(uy, U2, ++» , Um) = $7" (ou) +4 (u2) +--+» + O(Um)) - 


Also, Archimedean copulas are symmetric since, obviously, C (u1, u2) = C (u2, u1). 

To calibrate a copula model from empirical data, the canonical maximum likelihood (CML) 
method can be used. Assume a bivariate sample (x, y) of n observations. The joint distribution can 
then be represented by 


F(x, y,0) =C(F(x), F(y), 4) 


where 0 is the (set of) parameter(s) for the specific copula. F (x) and F (y) are the empirical 
marginals of x and y that are computed using their frequencies, 


í {= 
F(z) = N 2 lease: 
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The copula parameter(s) can be estimated by maximizing the likelihood function 


A 


0 = arg max L (0) 


with the likelihood function 


L= S GUO ÊO), 6)) , 


i=l 


where c is the copula’s density. As Genest et al. (1995) showed, this estimator is consistent and 
asymptotically normally distributed. To assess the goodness of fit, the log-likelihood values and 
related information criteria, such as Akaike’s (AIC) or the Schwarz—Bayesian (BIC), will briefly be 
addressed on page 166. For those interested in the details, Embrechts et al. (2003) and Cherubini et 
al. (2004) provide good introductions to the subject with a focus on financial applications. 


7.3.2 Simulation using copulas 


Like the estimation process, the simulation process also comprises two main steps: first, the quan- 
tiles with the copula dependence are generated, and second, these quantiles are translated into 
samples according to the marginal distributions. The latter step is usually done with the inversion 
method. For sampling the quantiles, different approaches exist. 


Metropolis sampling 


When the density of a copula is available, samples can be drawn by using the acceptance-rejection 
method. However, one limitation here is that it is not always easy to find a suitable majorizing 
distribution to generate samples. The Metropolis algorithm would still work in that situation. Re- 
member that in its original incarnation, a new candidate is generated by adding some uniform noise 
to the current value x;. Here, however, the support is limited to [0, 1], so the new solution can be 
generated as pure noise. This helps to reduce the serial dependence of the samples (though there 
will be some left since the density of x; is used). The downside of lower acceptance rates will often 
be compensated by the shorter burn-in times. 

The following MATLAB code produces a matrix of n two-dimensional samples, based on the 
Gumbel, Frank, Clayton, or Gaussian copula. 


Listing 7.23: C-ModelingDependencies/M/./Ch07/CopulaSim.m 


1| function [u,x,Finv,c] = CopulaSim(n,copula,p) 

2|% CopulaSim.m -- version 2011-01-06 

3] % ARRE E EE number of samples 

4|% copula .. name of copula (Frank,Clayton,Gumbel or Gaussian) 
5|% D seen sees parameter of the copula 

6 

7|% marginal 

8| Finv = @(u) norminv(u) ; 

9|% densities 

10| switch upper (copula) 

11 case ‘FRANK’ % parameter: p<>0 

12 c = @(u,v,p) px*(l-exp(-p)).*exp(-px*(ut+v)) ./ 

13 (exp (-p* (utv)) - exp(-pxu) - exp(-p*xv) + exp(-p)).%2; 
14 case ‘CLAYTON’ % parameter: p>= -1; p~=0 

1S c = @(u,v,p) (1+p) ./ ((u.*v).*(1+p) .*(u.*(-p) + 

16 v.*(-p)-1).*(2+1/p) ); 
17 case 'GUMBEL’ % parameter: p >= 1 

18 c = @(u,v,p) ( (-log(u)).*(p-1) .* (-log(v)).*(p-1) .* 
19 ((-log(u)).*p + (-log(v)).*p).*(1/p-2)) ./ ... 

20 (u.*v.xexp( ((-log(u)).*p + (-log(v)).*p).*(1/p))); 
21 otherwise % Gaussian % parameter: -1 < p < 1 

22 c = @(u,v,p) 1/sqrt(1-p*2) x exp((norminv(u).*2 + ... 

23 norminv(v).*2)/(2) + (2*p*norminv(u).*norminv(v) - 


Modeling dependencies Chapter | 7 151 


24 norminv(u).*2-norminv(v).*2) / (2*(1-p%2))); 
25| end 

26|% Metropolis 

27|u = NaN(n,2); 

28)u(l,:) = rand(1,2); 

29/} f_i = c(u(1,1),u(1,2),p); 

30| for i = 2:n; 

31 while isnan(u(i,:)) 

32 u_new = rand(1,2); 

33 fnew = c(u_new(1),u_new(2),p); 
34 if rand < f_new/f_i 

35 u(i,:) = u_new; 

36 f_i = f_new; 

37 end 

38 end 

39| end 

40|x = Finv(u); 


To illustrate the different properties, Fig. 7.3 depicts contour plots of the copulas’ densities and 
samples drawn from them. To see how this transforms into the distribution of the actual variates, 
we assume standard normal marginal distributions. The graphs in the right column of each panel 
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FIGURE 7.3 Densities for u and x (under the assumption of Gaussian marginals) and 500 samples. 
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show the resulting joint densities and samples. As can be seen, under a Gumbel copula, there is a 
stronger dependence on the positive end of the distribution, whereas for the Clayton, it is on the 
negative side. This last property coincides with stylized facts for empirical returns: assets tend to 
move in the same direction when markets crash but are less dependent in boom times. This makes 
the Clayton a popular choice for empirical applications. 

Copula models can be constructed in a “mix-and-match” fashion: any copula can be combined 
with any marginal. The Gaussian copula, for example, comes with one parameter that corresponds 
to linear correlation if marginals are normal, and its parameter is well understood. Combining the 
Gaussian copula with Gaussian marginals gives a fancy way of expressing multivariate normals. 
However, the Gaussian copula can also be combined with other marginals, and Gaussian marginals 
can be linked via any copula—or combinations of copulas (weighted combinations or nested ones). 
Also, there exist copulas with more than two dimensions, making it possible to link one variate 
to the simultaneous states of several others. This flexibility is probably the greatest virtue of this 
approach—but, in empirical applications, can also lead to overfitting. 


Direct sampling 


Another approach would be to draw the first variate, v, from the unconditional (0, 1) uniform dis- 
tribution and the second variate from the conditional one. For this, we require the partial derivative 


of the copula. Let C, aC.) 


verse. Then two samples x and y with marginals Fy and Fy, respectively, can be generated as 
outlined in Algorithm 25. For more details on this, including some examples, see, e.g., Fusai and 
Roncoroni (2008). 


be the partial derivative of C (u, v) with respect to u and Cy l its in- 


Algorithm 25 Direct sampling. 


1: draw two independent uniform samples u1, u2 ~ U (0, 1) 


2: impose dependence už =C l (u2) 


3: map dependent uniform samples according to marginals x = Fy l (u1), y = Fy ; (u3) 
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8.1 Setting the stage 


Simulation can be generally defined as experiments using models. In finance, there are many ar- 
eas that use simulation techniques. These include the generation of artificial data when there are 
not enough real observations; the generation of scenarios; and the testing of assumptions, con- 
cepts, or strategies when real-world experiments are not advisable; to name just a few. Also, 
simulation can be useful when analytical solutions are not feasible; pricing and risk estimation 
problems are the usual suspects in this group. Finally, simulations can be used to get a better un- 
derstanding of the behavior and properties of systems, and to spot the need for adjustments and 
extensions. 

Traditionally, financial models come as econometric or mathematical sets of equations, 
describing deterministic and stochastic elements and how they relate. More recently, agent- 
based models for complex dynamic systems have gained importance; these focus on mi- 
crostructures and investigate how things behave on an aggregate level (e.g., prices). In partic- 
ular in complex systems, parts of these structures can change: agents can interact and influ- 
ence each other’s behavior, or their environment changes. Section 8.8 provides an example of 
this. 

In the main part of this chapter, though, we follow the more traditional route: relation- 
ships, dependencies, and developments can be captured by a quantitative model, the structure 
of which is known and does not change over time. Sometimes, all the ingredients are readily 
available: checking how the price of a bond reacts to an increase in the discount rate would be 
an example for this. In other situations, some of the variables come from historical or Monte 
Carlo simulation. Again, we have to limit ourselves, and it is the latter we will mainly address 
here. 
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8.2 Single-period simulations 
8.2.1 Terminal asset prices 


One of the many uses of Monte Carlo simulation is to develop an idea and intuitive understanding 
for how financial models work and what outcomes to actually expect. 

Consider the humble stock. One of the simplest—and most popular—assumptions is that prices 
are lognormally, and returns therefore normally, distributed. A nice feature of the normal distribu- 
tion is that it only requires two parameters, mean and variance. The standard deviation (i.e., square 
root of the variance, or, in finance talk, the volatility) gives us an indication of how likely deviations 
from the expected value are: two standard deviations around the mean should cover about 19 out 
of 20 cases, i.e., approximately 95%. The exact probabilities for a confidence interval of c standard 
deviations around the mean can be computed by prob(X € w+co) = N(c) — N(—c) =2N(c) — 1. 
In R, this can be done by 


2 * pnorm(c) - 1 
for any prespecified value c. The MATLAB® equivalent would be 
2 * normecdf(c) - 1 


To get a flavor of what this means in practice, let us assume that the yearly volatility of a stock 
is 20% and the drift of the stock is 5%; hence, r ~ N(0.05, 0.22). To get an idea of what the stock 
could be worth at the end of the year, we can simulate 1000 sample returns 


r = randn(1000,1) +» 0.2 + 0.05 
and compute the corresponding stock prices 
S_T = S_0 * exp(r) 


for given initial stock price S_0. For a sufficiently large sample (or by averaging repeated Monte 
Carlo simulations), one should find that the mean sample return converges to jz and the volatility 
actually does converge to o. Histograms of the returns and terminal stock prices should exhibit 
the symmetry and positive skewness of the normal and lognormal distribution, respectively. The 
latter also causes the expected stock price to be above its median: since the normal distribution is 
symmetric, half of the returns should be above u, and the other half below; this corresponds to a 
stock price of So exp(jz). In the stock prices, however, the positive deviations will be bigger than the 
negative ones; this is a straightforward consequence of taking the exp(-). Consequently, the mean 
will be shifted to the right, and the average of the simulated terminal stock prices will converge to 
their expected value E (Sr) = Spo exp(u + 1D): SimulateStk.m provides a short simulation 
of this case. 


Listing 8.1: C-FinancialSimulations/M/./Ch08/SimulateStk.m 


function SimStk = SimulateStk(mu,sigma,N_samples) 


1 

2|% SimulateStk.m -- version 2011-01-10 
3|% mu, Sigma ....: drift and volatility 
4/3 N ............: number of samples 

5| SimStk.mu = mu; 

6| SimStk.sigma = sigma; 

7| SimStk.r = randn(N_samples,1)*sigma + mu; 
8) SimStk.S = exp(SimStk.r); 

9| SimStk.meanr = mean(SimStk.r); 

10| SimStk.stdr = std(SimStk.r) ; 

11| SimStk.meanS = mean(SimStk.S) ; 

12| SimStk.muS = exp(mut+sigma’2/2); 

13| SimStk.medS = median(SimStk.S) ; 

14 


15|r = SimStk.r; S_T = Simstk ss 
16|% display results 


17| fprintf (' mean: %4.2f (mu = %4.2f)\n’,SimStk.meanr,mu); 

18| fprintf (° volatility: %4.2f (E(r) = %4.2f)\n’,SimStk.stdr,sigma); 

19| fprintf(’ exp. price: %4.2f (E(S) = %4.2f)\n’,SimStk.means, exp (mu+sigma^2/2)); 
20| fprintf(’median price: %4.2f (M(S) = %4.2f)\n’,SimStk.medsS, exp (mu)); 
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FIGURE 8.1 Simulated returns and stock prices for 100 (left) and 10,000 samples (right). White circles are theoretical 
quantiles and black ones are their empirical counterparts. Theoretical and empirical means are white and black stars, respec- 
tively. 


The asymmetric confidence intervals become also apparent in the graphical output (Fig. 8.1 is 
an example): what is symmetric in the return histogram becomes asymmetric in the price histogram. 
Hovering above the median are the mean values. As can be seen, these are (typically) equal to the 
median for the returns, but shifted to the right for the prices. 

The distribution of the samples will vary, so every run will produce slightly different sample 
mean returns and standard deviations. With increasing sample sizes, however, the Monte Carlo 
noise should be reduced. 


8.2.2 1-over-N portfolios 


If one takes this example one step further and is interested in a portfolio of assets rather than 
individual stocks, one basically has two alternatives: 


1. Analytically derive the properties of the portfolio, consider it as a special type of asset, and 
simulate it in the same fashion as any individual asset. The advantage of this direct simulation 
approach is that it keeps the number of variables down to a bare minimum. On the downside, it 
requires that a reliable formal description of the portfolio’s properties be possible. 

2. Simulate the individual stocks and combine the samples, like empirical data, into a portfolio. 
The advantage here is that this works even if the properties of the overall portfolio cannot be 
derived (or are too cumbersome to get) as long as one can get simulations of the constituents. 
This is particularly true when, for example, individual assets follow nonstandard distributions, 
have contingent payments, or are linked in a nonlinear fashion. Consider a portfolio that com- 
bines stocks, bonds, and derivatives in different currencies. The downside here is that each asset 
has to be simulated individually, which increases runtime and system requirements. 


For the sake of simplicity, we stick with stocks that have normally distributed returns, which 
all have the same drift u and volatility o, and they all have the same pairwise linear correlation 
p. The latter implies that they all have the same variances ø? and the same pairwise covariances 
Oij = Pj TIO; = pa? Vi # j. The covariance matrix therefore has ø? along the diagonal and every 
off-diagonal element is op. 

If we combine different assets into portfolios, then some of the individual variations will cancel 
out, and the portfolios’ variances should be below the average of individuals assets. A portfolio’s 
variance is the weighted sum of the covariances of all its constituents. To make things even sim- 
pler, assume that all of the N assets in the portfolio have the same weight w; = !/n because the 
investor follows a “l-over-N” strategy. The portfolio’s statistics for increasingly many different 
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assets converge as follows: 
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where o and o;; are the average standard deviation and covariance (for pairs i # j), respectively. 

In this case, the direct simulation approach would be straightforward. The mean and volatilities 
of the portfolio returns are given as part of the above derivation of the limits, and the simulation 
would be like for a single stock with unchanged drift (u p (N) = u), yet lowered volatility (op (N) = 
o JI/N + (1 = 1/N)p). 

The code SimulateloverN.m illustrates the indirect simulation approach. First, the indi- 
vidual assets are simulated and then combined into a portfolio. The larger the number of different 
assets the closer the portfolio’s signatures will be to their theoretical values, and this convergence 
process will be smoother when more samples are used for this simulation. In the case of the mean 
portfolio return, this will show also in the changing scale of the y-axis—which, at first glance, might 
lead to the optical illusion that there isn’t a lot of change. The boxplots give a better overall impres- 
sion: increasing the number of different assets, N, reduces the portfolio’s variance (ultimately) to 
the average covariance, and the bandwidth of portfolio returns becomes smaller. 


Listing 8.2: C-FinancialSimulations/M/./Ch08/simulate loverN.m 


1| function r_Pf = SimulateloverN(mu,sigma,rho,N_samples,N_stocks) 

2|% SimulateloverN.m -- version 2011-01-06 

3) % mu, Sigma ...: drift and volatility (same for all stocks) 

4| % rho .........: linear correlation 

5|% N_samples ...: number of samples 

6| & N_stocks ....: maximum number of stocks 

7 

8| CovMat = eye(N_stocks) * sigma^2 + (ones(N_stocks) - eye(N_stocks)) * sigma^2 
* rho; 

9l e = randn(N_samples,N_stocks)» chol(CovMat) ; 

10} r = mu +e; 

11 

12|% compute mean return for equally weighted portfolio 

13|% of first 1 ... N_ stocks stocks 

14| r_Pf = NaN(N_samples,N_stocks); 

15| for i = 1:N_stocks 

16 w = ones(i,1)/i; 

17 r_Pf(:,i) =r(:,1:i) * w; 

18| end 


The results for the 1-over-N portfolios also highlight another point (Fig. 8.2): despite diversi- 
fication, realized values can differ substantially from the expected ones. What is plainly obvious 
from a statistician’s point of view is sometimes neglected in portfolio evaluation. Volatility is a 
measure of likely deviations from the expected values. If an asset has normally distributed returns 
with an expected value of u and a volatility of o, then there’s roughly a two-thirds chance that the 
realized (out of sample) return will be within the range of u + o and a 95% chance that it will be 
within u + 20 (see footnote 10 on page 170). In other words, in one out of three cases, the actual 
return will deviate by more than the volatility; in one out of 20 cases, it will deviate by more than 
twice the volatility from what was expected. Likewise, the realized volatility can differ from what 
was expected, just by chance. 

In financial management, this has two main consequences: first, evaluating the performance of 
a portfolio ex post is difficult because unsuitable portfolio compositions and unlucky market events 
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FIGURE 8.2 Returns for portfolios consisting of N equally weighted assets (“1-over-N”). 


are hard to distinguish, as is luck from skill. Second, estimations based on past observations are 
subject to sampling errors, even when the processes are stationary. We will revisit this problem in 
Chapter 14. 


8.2.3 European options 


On several occasions in this book, European style options are used as simple examples for deriva- 
tives. Call (put) options allow the writer of the option to buy (sell) the underlying asset at the 
exercise or strike price X. Obviously, the buyer will exercise only if the option has a positive inner 
value, that is, if the strike is below the underlying’s price for a call or above for the put. If the option 
is of European style, this right can be exercised at one specific date only (exercise day or maturity, 
T), whereas American style options can be exercised anytime until maturity. 

Landmark models for option and derivative pricing are the ones suggested by Black and Scholes 
(1973) and Merton (1973). Their framework assumes, among other things, that the underlying’s 
price follows a geometric Brownian motion with constant volatility and drift, does not pay divi- 
dends,! that continuous and frictionless trading is possible, and that trading will not affect prices. 
Their main idea is that underlying and derivative can be combined so that the combination is risk 
free, hence a suitable portfolio should earn exactly the safe return—nothing more, nothing less. In 
other words, it is enough to consider a risk-neutral investor since any risk can be diversified away. 
In this book, we use the Black—Scholes—Merton framework as a simple setting to demonstrate dif- 
ferent numerical methods (see, e.g., Sections 4.3 and 17.1). Here, we will try to use Monte Carlo 
simulation to demonstrate the basic workings. For convenience, we repeat the main Black-Scholes 
pricing formula for a European call: 


co = So N(d1) — Xe7"” N(d2) (8.1) 
—S S&S 
© © 


with current price of the underlying Sọ and N(-) being the CDF of the standard normal distribution 
and parameters 


d = ( e) (r z) d d =d; — T 
= an = ON 1. 
1 — 2 1 


The inner value of a European call at maturity is 


cr = max(Sr — X,0), 


1. Subsequent models include discrete dividend payments or continuous dividend yields; Hull (2008) provides a compre- 
hensive survey of pricing models. 
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where Sr is the price of the underlying at maturity T and X is the strike. In a risk-neutral world, 
both the underlying and the option can be assumed to earn the safe return; the present value of the 
option can then be found by discounting the expected payoff with the risk-free return, rf, while the 
underlying can be simulated so that the expected price is the present price plus safe interest”: 


E‘(S7) = So exp(7:T) log(Sr/So) ~N ((r: z F) T, o°T) 
Ef(cr) = Ef(max(Sr — X, 0)) co = Ef (er) exp(=rtT), 


where ø is the underlying’s volatility. To ease presentation, we drop the reference to risk-neutral 
settings and assumptions in the following discussion. 
The main steps for Monte Carlo pricing of a European call option are collected in Algorithm 26. 


Algorithm 26 Monte Carlo simulation for a European call option. 


simulate underlying’s returns ry with volatility o v/T and mean (rf — o? /2)VT; 
compute underlying’s price samples, Sr = So exp(rr) 

compute call’s (inner) values at maturity, cr = max(S7 — X, 0) 

compute present value of average call value, co = E(cr) exp(—rfT) 


Fe IS 


Term © in the Black-Scholes solution (8.1) relates to the expected value of the underlying, and 
term © relates to the expected payments of the strike. Either term can be found in the Monte Carlo 
simulation. 


e The inner value of the option is positive (i.e., the option is exercised) whenever Sr > X. The 
likelihood of the option being exercised is therefore the number of samples with positive inner 
value in relation to the overall number of samples. In MATLAB, a logical check provides a 
Boolean value coded as 0 (false) or 1 (true); it is therefore enough to simply take the 
average of the vector of Booleans: if ST is the vector with the underlying’s price samples and X 
is the strike, then ex = ST>X is a vector of Booleans about exercise yes/no, and mean (ex) 
provides the probability of exercising for the sample. In the Black-Scholes model, this likelihood 
is N(d2) and, with a sufficiently large sample, the Monte Carlo simulation should come close 
to this theoretical value. X is the strike being payed in case the option is exercised; X N(d2) 
is, therefore, the expected payment, and Xe~’t? N(dp) is its present value—which is term © in 
Eq. (8.1). 

e If exis the vector with Booleans of exercising, then ST (ex) filters the sample prices above the 
strike, and mean (ST (ex) ) is the conditional price of the underlying in the case of exercise; 
as we already know, the likelihood for this case is mean (ex) . If the option is not exercised, 
the buyer does not get the underlying, and this position is worth zero. (But then, he or she 
does not have to pay the strike either.) All in all, the expected future value of this position is, 
therefore, mean (ST(ex)) * mean (ex). Discounting with the risk-free return provides the 
present value: mean(ST(ex)) * mean(ex) * exp(-rFx*T). This corresponds to term 
@ in Eq. (8.1). 


The following MATLAB code provides an implementation; the output contrasts theoretical to 
simulated values. Section 9.3 presents further Monte Carlo techniques for option pricing. 


Listing 8.3: C-FinancialSimulations/M/./Ch08/EuropeanOptionSimulation.m 


oe 


EuropeanOptionSimulation.m -- version 2011-01-06 


% parameters 
S0 = 100; 
100; 
ee. = ,05? 
sigma = .2; 


SIADWNARWNE 
x 
1l 


2. In a more realistic, risk-averse world, the risk-adjusted interest rates had to be used; yet under the assumed perfect risk 
diversification, risk premia ought to cancel each other out, and the investor is left with the risk-free return. 
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9}ns = 100000; % number of Monte Carlo samples 

10 

11|% Black-Scholes price for European call 

12}d1 = (log(S0/X)+(rf+sigma*2/2)*T) / (sigma * sqrt(T)); 
13| d2 = dl - sigmaxsqrt(T); 

14| cOBS = SO*normcdf(d1) - X*exp(-rf*«T) » normcdf (d2); 

15 

16|% MC Simulation 

17| rs = randn(ns,1)*sigmaxssqrt(T) + (rf - sigma^2/2)*T; 
18| STs = SO*exp(rs); 

19| cTs = max(STs-X,0); 

20| cOMC = mean(cTs) «exp (-rf*T); 

21 

22|% "ingredients" 

23| ex = STs>X; % Boolean: exercise yes/no 

24 

25| fprintf ('Simulation Results: \n===ss=ssssesesesessss===\n’); 
26| fprintf(’E(stock price): %8.4f (theor.: %$8.3f)\n’, 

27 mean (STs) SO*xexp(rf*T)]); 

28| fprintf ('prob. exercise: %8.4f (theor.: %8.3f£)\n’, 

29 mean (ex) normcdf(d2)]); 

30| fprintf(’PV(E(paymt X)): %8.4f (theor.: %8.3f)\n’, 

31 mean (ex) *«X*exp(-rf*T) X*exp(-rf*T) * normcdf(d2)]); 
32| fprintf(’PV(E(paymt S)): %8.4f (theor.: %8.3f)\n’, 

33 mean (ex) xmean (STs (ex) )*exp(-rf*T) SO*normcdf(d1)]); 
34| fprintf (’======================\ncall price: %8.4£ (theor.: %8.3£)\n’, 
35 cOMC cOBS]); 


8.2.4 VaR of a covered put portfolio 


As mentioned, simulating and aggregating a portfolio’s constituents is particularly useful when 
analytical aggregation is challenging. To illustrate this, let’s again assume an asset with normally 
distributed returns, but now it is combined with a put option. Again to keep things simple, we 
assume that it is a European put and that no intermediate trading is going on. If the strike for the 
put is X, then the put will be exercised if and only if the underlying’s price is lower than the strike 
(Sr < X); in this case, the investor gets a price that is (X — Sr) above the stock’s then market 
price. If the stock price exceeds the strike (Sr > X), selling the stock directly will provide a higher 
price, and the put expires without exercise. What happens in the unlikely event of Sr = X does 
not make a difference; it seems reasonable to let the put expire: exercising does not yield a price 
higher than the market price but causes extra effort. In the presence of market frictions such as 
transaction costs, pin risk,> and opportunity costs of actually having to act, the limits between 
exercising or not could be shifted. In short, the value of the put at maturity is piecewise linear, 
max(X — Sr, 0). Clearly, the put’s payoff at maturity will not be lognormally distributed like the 
asset. Hence, simulating the value of a portfolio consisting of stocks and puts directly could be 
cumbersome. This is even more true for any point in time before maturity, t < T, where the put’s 
value is a nonlinear convex function with respect to the underlying’s price S+. In this case, it is 
much simpler to first simulate the individual constituents and then combine them according to the 
portfolio’s composition. 

To illustrate this, assume an investor owns a stock but wants protection against losses at some 
point T in the future. To do this, he or she buys one European put option per stock and an exer- 
cise price equal to the current stock price. The portfolio will be perfectly protected against losses 
at time T, but not in the intermediate time: at any point t < T, the absolute Delta of the put 
is always less than one, —1 < Ap,_, < 0. In plain English, this means that a $1 drop in the 
stock price will increase the put price, but by less than $1.* Intuitively, this is because immedi- 


3. If there is a temporal gap between receiving the underlier and the earliest time to sell it (e.g., if delivered on Friday 
evening, it cannot be sold before Monday morning), there is a residual risk. When the option is only slightly in the money, 
the buyer could then opt not to exercise. 

4. The Greeks are discussed in more detail in Sections 5.5 and 9.3.3. 
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ate exercise at time t is not possible, but there is a chance the stock price will recover again until 
maturity. 

Let us assume that the stock has an initial price Sọ = 100 and the annual log return is nor- 
mally distributed with r ~ N(0.1, 0.27), whereas the annual safe return is rf = 0.05. The put has 
time to maturity of T = 0.5 and strike X = Sp = 100, and, for simplicity, the Black-Scholes 
model provides a fair price for it. To simulate the portfolio wealth at 0 < t < T, one needs to 
first simulate the stock prices at time t. The log returns until then are normally distributed with 
re ~ N(41/z, (0.2/y2)). 

At time tT, the remaining time to maturity for the put is T — t. Note that the option is priced 
under risk neutrality and continuous trading; the Black-Scholes model therefore requires the safe 
return, not the stock’s drift. 


Listing 8.4: C-FinancialSimulations/M/./Ch08/SimulateStauPtau.m 


1| function [PF_tau,S_tau,p_tau] = SimulateStauPtau(S_0,drift,vol,X,rSafe,T,tau, 
NSample) 
SimulateStauPtau.m -- version 2011-01-06 
simulate portfolio with 1 stock and 1 put at time tau 
50 initial stock price with return ~ N(drift,vol^2) 
strike of put 
riskfree rate of return 
time to maturity of put 
tau point of valuation; if not provided: tau = T 
NSample number of samples for the simulation 
10| if nargin < 7, tau = T; else tau = min(T,tau); end 
ll} if nargin < 8, NSample = 10000; end; 
-- simulate stock prices at tau 
r = randn(NSample,1)*(vol*sqrt(tau)) + drift * tau; 
14| S_tau = S_0 * exp(r); 
-- compute put prices at tau; time left to maturity (T-tau) 
T_left = T - tau; 
17| p_tau = BSput(S_tau,X,rSafe,0,T_left,vol); 
% -- value of portfolio at time tau 
19| PF_tau = S_tau + p_tau; 
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SimulateStauPtau.m performs exactly this simulation." It also produces histograms of 
the simulated stock prices and portfolio wealth for the selected time t. When experimenting with 
different ts, it becomes apparent that in the beginning (t close to 0) the portfolio wealth is more 
or less the stock’s distribution, shifted to the right by the price of the option. The closer t gets to 
T, however, the more the portfolio’s distribution converges to a truncated version of the stock’s 
with extra density at X. This spike represents all the cases where the put is exercised. Fig. 8.3 gives 
examples for various values of t. 

Under given assumptions, the value of the put at time t = 0 is po = 4.42; the combination of 1 
stock plus 1 put is therefore initially worth Pp = So + po = 104.42. Over time, the portfolio value 
P, will diffuse. At maturity, its lowest possible value will be Pr = X. For the time in between, 
however, the portfolio value is P; = S+ + p;(S;) that can be below X. The Value-at-Risk, VaRy_7, 
is the loss at time t exceeded only with probability w.° For our purposes, it can be computed by 
taking the (1 — œ) quantile of the simulated portfolio wealth and subtracting it from the initial 
value of the portfolio. As can be seen from Fig. 8.4, the portfolio’s VaR is substantially lower 
than that of the unprotected stock. But then so is the portfolio’s return: the portfolio’s median log 
returns are less than the stock’s, and confidence bands are narrower around this lower median as 
well. 


5. The function BSput was introduced in Chapter 4 and computes the Black-Scholes price for a put. 
6. Sometimes, a denotes the probability that the VaR threshold is not exceeded; the shortfall probability is then, obviously, 
l-a. 
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FIGURE 8.3 Histograms for the value of a stock (top panel) and a portfolio consisting of the stock plus one European put 
(bottom panel) at times T. 
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FIGURE 8.4 Left panel: Value-at-Risk for stock (dotted lines) and portfolio (solid line) for 1%, 5%, and 10% quantiles 
(light to dark). Right panel: 5%, 25%, 50%, 75%, and 95% percentiles of log returns for stock (dotted) and portfolio (solid 
line). 


8.3 Simple price processes 


When making assumptions about financial price processes, the term “random walk” usually comes 
up very quickly. The argument behind this is actually quite simple and intuitively appealing: in a 
perfect world, all publicly available information should be reflected in the current price; hence any 
price change should be unpredictable or, in other words, random. If this were not the case and price 
movements were, at least to some extent, predictable, then this could allow for arbitrage or at least 
statistical arbitrage.’ With sufficiently many informed investors participating, this should not be 
possible, and their activities should cause markets to become semi-efficient. 

From a modeling perspective, the price process can then be split into a deterministic and a 
stochastic part. From a theoretical point of view, martingales are the simplest building block for 
efficient markets: 


e If today’s price is the best estimate for tomorrow’s, then we speak of a martingale. Let Q; denote 
the information available at time t. Then, for log prices s; = log(S;), 


E; (s141 (2) = st. 


In particular for short time steps, this is often a convenient and sufficiently accurate assumption. 
e If the expected values increase (decrease) over time, they are a sub-martingale (super- 
martingale): 


E (s1412) = (S)sr - 


7. In short, arbitrage exists if a self-financing position can produce a positive net payoff, but will never require a negative 
net cash flow. In statistical arbitrage, the probabilities for positive and negative net payments should converge to one and 
zero, respectively. 
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e When log prices are a martingale, then any price change comes as a surprise, and nonoverlapping 
price changes are uncorrelated at all leads and lags. In other words, the changes in the log prices 
(i.e., the log returns) are a fair game: 


E; (Asr) = Es (rerl) =0. 


e Finally, all market participants p are assumed to use the same information and probability den- 
sity function: 


FP r4il QP) = f Oal). 


In particular when only short periods of time are considered, martingales are often a convenient 
working assumption. For longer periods, however, investors want to earn some risk premium, and 
the above considerations hold for excess returns. The Brownian motion assumes that the new log 
price is the previous plus a (deterministic) drift, jz, plus a (stochastic) residual, €z: 


St41 =S HUE, where e; ~ N(0, o°). 
maama 


=r; 


For the residuals, the previous assumption should hold. In particular, there should be no serial 
correlation at any lead or lag, and their expected values should be zero. By definition, the stock 
price itself then follows a geometric Brownian motion (GBM). For steps of arbitrary length t, this 
reads as 


Spar = S;exp(r;) with r; ~ N(ut, 077). 


A convenient way of separating the stochastic and deterministic part of this process is to intro- 
duce a Wiener process, {z;}. This is a (continuous) Brownian motion with zo = 0 and z;41 — Zt ~ 
N(O, 1). In the discretized version, the model for the prices is therefore 


log(S:+ar) = log(S;) + wAt + ov Atz; 
Srtar = S; exp (war +oav Atz). 


When calibrating and testing this model for a given data set, one estimates the parameters u and 
o and tests whether z; really is white noise. When simulating data, one takes exactly the opposite 
route: one simulates the noise with the desired properties and plugs it into the parameterized model. 
So for the GBM, the estimation and simulation procedures for a time series S9 and unit time 
steps are as follows: 


Model: Simulation: 
s = log(S); a = randn(T,1); 
a = diff(s); r= 2z x sigma + mu; 
mu = mean(r); s = cumsum([log(S_0); r]); 
sigma = std(r); S = exp(s); 
Zz = (r-mu) /sigma; 


For all remaining models in this chapter, the workings will be similar: generate the residuals by 
drawing random samples from a specified distribution, and plug them into the model (together with 
realizations of explanatory variables, where applicable). 

Similar to the general principles in Section 6.6.4, variance reduction techniques can be em- 
ployed to reduce the Monte Carlo noise and promote fast convergence. Section 9.3 will provide 
examples as well as caveats. 
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8.4 Processes with memory in the levels of returns 
8.4.1 Efficient versus adaptive markets 


The efficient market hypothesis states that new information is (i) immediately, and (ii) completely 
absorbed. A fundamental consequence of this is that returns should not exhibit serial dependencies. 
In other words, knowing yesterday’s return should not allow superior predictions. 

In the real world, however, this might not be the case. In this section, we will distinguish three 
different scenarios: 


e New information will impact not only today’s, but also the subsequent days’ returns. In other 
words, the return today will be affected not only by new information, but also by information that 
arrived previously. As a consequence, a (strong) piece of information will affect the performance 
over a sequence of days. This property can be modeled with moving average (MA) models and 
will be presented in Section 8.4.2. 

e When investors consider previous returns as valuable information, then previous returns will 
directly influence today’s performance. In this case, the return will depend on new information 
and on preceding returns. This can be captured with autoregressive (AR) models where lagged 
realizations of the dependent variable become explanatory variables. Section 8.4.3 discusses the 
basic principle for this type of model, and Section 8.4.4 links it to moving average models. 

e Finally, adaptive and changing beliefs can be the reason that one (main) signal can have an 
impact over a longer stretch of time. This property can be captured in learning models, and some 
further examples will be presented in Section 8.6, dealing with adaptive markets where new 
information is absorbed only gradually. 


In either of these cases, the returns will no longer be normally identically independently dis- 
tributed (n.i.i.d.) even when the arriving information is. The symptoms can be quite diverse: returns 
can start exhibiting temporal patterns, and their means seem to move over time; autocorrelation can 
emerge; and volatility can become time varying. 


8.4.2 Moving averages 


When a shock (i.e., the residual) has a noticeable “echo” in the subsequent period, this can be 
modeled by 


re =Ut+O1e;-1 +e. 


If a previous positive shock, e;_1, has a positive effect today, 6; > 0, then the expected value for r; 
will go up; negative shocks will have the adverse effect. In real markets, this could be due to some 
trend followed by market participants. A negative 6; will lower expectations at t after a positive 
shock; the real-life situation could be a correction to a previous overreaction. In either case, any 
shock will have an instant and a delayed effect, causing the average of r; to move. More generally, 
a qth order moving average (MA) model, MA(q), can be denoted as 


q 
r=uųu+ X beere Fep 
l=1 
Creating a vector of MA(q) returns, ro p involves simulating the T + q shocks and then 
combining them into returns. Note that q additional residuals are required to generate r1. 


Listing 8.5: C-FinancialSimulations/M/./Ch08/MAsim.m 


1| function [r,e] = MAsim(T,mu,sigma, theta) 
2|% MAsim.m -- version 2019-04-17 

3) % simulation of MA(q) process 

4|q = length(theta) ; 

5|e = randn(T+q,1) * sigma; 

6|r = zeros(T+q,1); 

7| for t = q+t(1:T) 
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10)r(1:q) = []; 
ll}e(1:q) = 


Some important properties of this model are as follows: 


e The expected value of 7; is u since all the e;_;s have expected value 0. In the long run, the 
short-time shifts of averages should vanish since positive and negative effects will outbalance 
each other (remember that E(e;) = 0). 

e If the residuals are n.i.i.d. with variance o2, then the variance of r; will be 


q 
o? =0} + 6702 +0202 +- +02 P= (143008) 2. 
fel 


This implies that the returns will have higher volatility than the individual shocks. Intuitively, 
this is because any return is influenced by the current innovation as well as the previous ones. 

e The returns are autocorrelated. Any shock will have a linear impact on the current and the sub- 
sequent g observations. Hence, the covariance between current and lagged returns will be 


Cov(r;, r:—-b) = (609 + O19b41 ++ +++ Oq—pOq)o2 . 
Obviously, current returns are only correlated to returns that are at most q periods old; any 
returns that are farther apart will be uncorrelated. 


8.4.3 Autoregressive models 


Moving average models exhibit autocorrelation. However, current values do not depend on shocks 
older than q, and when there is long memory, then this would require many 6¢s to be calibrated. 
Alternatively, one could simplify matters by stating an autoregressive (AR) model where the current 
return is linked directly to the previous one (which, after all, contains, e;—1): 


re =H + Qiri +e. 


More generally, if the previous p returns are directly affecting the current one, then this is denoted 
as AR(p) process, 


re =U + Qiri t:i + bpri-p ter 


P 
=u+Y preter. 
t=1 


To see the relationship between AR and MA models, take the simple case of an AR(1) model. 
Repeated substitution of 7;_; yields 


re = u + Qı (ri—1) +4: (8.2) 
G 
—_—_—_—_—_—_—— 
= u + il(u +  (r-2) +er-1) + €r 
TA 


oer 
= u + pı (u + oi(u + pı (1-3) +er—2) + €r-1) + er 
-— 


[0.6] [0.6] 
= y uoi + Pa 1-09} 
€=0 l=0 


This exercise can be performed for any AR(p) model with more than 1 lag; the equations become 
slightly more messy, but the basic result is the same: (weighted) drifts are piled up, and so are all 
of the previous residuals. This has some substantial consequences: 
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e The former implies that the unconditional expected return of r; will not be u, but 


E(r;) =m /(1—¢1— $2 ++: — op). 


Note that for reasonable processes, the sum of the pes should add up to some nonnegative number 
that is smaller than one. 

e As in the MA model, any current and previous returns will be linked. There are some funda- 
mental differences, however. In the MA(qg) model, no residual older than q periods is affecting 
today’s return. In the AR model, all previous residuals are affecting the current return, either 
directly or indirectly. Consider the AR(1) model from Eq. (8.2). Today’s residual, e;, is directly 
affecting today’s return. The residual from the day before, e;_;, does not explicitly show up in 
this equation, but it has an indirect effect because it is contained in r;_;. And because r;_; con- 
tains r;—2, e;—2 will also have an indirect effect. Consequently, any AR(p) model can be viewed 
as a special case of an MA(oo) model. Admittedly, the echo of older observations will be in- 
creasingly faint (and, in practice, unnoticeable), but if the shock in the price is large enough, it 
will have a very long impact. 

e Asin MA models, the returns will also have higher volatility than the individual shocks: every 
return contains many (weighted) individual residuals, and the (unconditional) variance will be 
increased. Therefore, the variance of an AR(1) process, for example, is o? = o2 /A- $7). 

e For the same reason, returns will also be autocorrelated. Again considering the AR(1) model, 
the autocorrelation at lag b is pł. Because of the previously indirect effects, shocks older than 
the maximum lag in the AR model can have an impact. 


Due to its longer memory, it is necessary to allow the system to burn in and the first sample 
already has some history. The following MATLAB code therefore produces twice the amount of 
samples necessary, but it discards the first half and reports only the second half. 


Listing 8.6: C-FinancialSimulations/M/./Ch08/ARsim.m 


1| function [r,e] = ARsim(T,mu,sigma,phi) 
2|% ARsim.m -- version 2019-04-17 

3|% simulation of AR(p) process 

4 

5|p = length (phi); 

6|e = randn(2»T+p,1) * sigma; 

7T|/r = ones(2*T+p,1) * mu /(1-sum(phi(:))); 
8| for t = p+(1:(2*T)) 

9 r(t) = mu + r(t-(1:p))’ * phi(:) + e(t); 
10} end 

1l)r(1:(p+T)) = []; 

12}e(1:(p+T)) = [1]; 


8.4.4 Autoregressive moving average (ARMA) models 


Moving average models capture the fact that returns depend not only on current information, but 
also on signals that have arrived over a previous stretch of time. This could happen if new infor- 
mation is only gradually absorbed or reaches market participants at different points in time. As a 
consequence, any new signal has not only an immediate, but also a delayed, effect. 

Autoregressive models assume that there is a linear relationship between current returns and 
their own history. This type of model can be used when (some) investors base their decisions on 
recent price movements: in a bull market, profits attract more buyers who will drive up the price 
even further; and falling prices are seen as a sell signal that will prolong the downward move- 
ment. 

These two concepts can be combined; not surprisingly, the resulting model is then called au- 
toregressive moving average model, ARMA(p, q): 


166 PART | II Simulation 


p q 
rr=p+) priet) Meee ter. (8.3) 
l=1 =l 


Obviously, this model nests the two individual ones: setting p = 0 (or ġe = O Y£) reduces it to the 
moving average model MA(q), whereas q = 0 (or 6g = O Y£) blanks out the MA part and leaves an 
AR(p) model. 

In econometric analysis, applying the ARMA model typically starts with the identification, that 
is, determining p and q. The estimation of the parameters can be done by minimizing the squares of 
residuals, by maximizing the likelihood, or by using an information criterion. The latter considers 
not only the residual sum of squares but also the number of parameters used in the model and the 
number of available observations; by doing so, it can guide identification. Two popular choices are 
Akaike’s information criterion (AIC; Akaike, 1974), 


AIC =—2L+2k, 


and the Schwarz’s Bayesian information criterion (SBC or BIC; Schwarz, 1978), 
BIC = —2L + 1n(T)k, 


where k = 1 + p + q is the number of parameters included in the model and L is the loglikelihood. 
The optimal model is found by minimizing any these information criteria. Both criteria have their 
respective merits and disadvantages: AIC is not consistent but usually efficient and tends to favor 
larger models, whereas BIC is consistent but inefficient.® 


8.4.5 Simulating ARMA models 


To simulate a moving average process, one can follow the reverse engineering procedure introduced 
in Section 8.3: first, generate the independent noisy bits, then modify them according to the model 
specification, and combine them with all dependence structures from the model. 


1. Generate the n.i.i.d. residuals fe} iy with standard deviation ce. 

2. To get the returns, add up the drift, the current residuals, the weighted lagged residuals, and the 
weighted lagged returns. 

3. Convert the return series into a price series by adding the returns (which, by definition, are the 
differences of the log prices) and take the exponentials. 


For the MATLAB implementation, one has to bear in mind that only positive indices are allowed. 
Also, because the AR process assumes the existence of previous returns, it is not uncommon to 
initialize the system, where necessary, with the unconditional expected values and generate more 
residuals and longer time series than required. The idea is to allow the system to burn in but only 
use the last T observations. 


Listing 8.7: C-FinancialSimulations/M/./Ch08/ARMAsim.m 


1| function [S,r,e] = ARMAsim(mu,phi,theta,sigma,T,Tb) 
2|% ARMAsim.m -- version 2011-01-06 

3 

4|% -- prepare parameters 

5| if nargin < 6, Tb = 0; end 


8. For more details, see Greene (2008) and Hastie et al. (2013). A relevant special case are normally distributed residuals. 
The AIC then comes in different versions, including AJC = T In(RSS) + 2k +c or In(RSS) + 2k/T + c/T where RSS = 
Xy e, and AIC = T In(2) + 2k +c! or In(é2) + 2k/T + c'/T where 6 = RSS/T; c and c’ are constants (and are 
sometimes omitted); see also Section 16.3.2. AIC and BIC are used for model comparison and model selection: an additional 
parameter increases k, which has to be outweighed by the decrease in RSS (or ô or increase in L) to have an overall 
benefit. Because the alternative versions differ either by a constant or are scaled by a constant, the ranking of different 
models will not be effected—but the numerical values will differ. That makes it difficult to compare results from different 
packages. 
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6| phi = phi(:); theta = theta(:); 
Tip = length (phi); q = length (theta); 
8| T_compl = T + Tb + max(p,q); 


9| lag_p = 1:p; 


12|% -- initialise vectors 

13) r = ones(T_compl,1) =» mu; 

14/e = randn(T_compl,1) * sigma; 

15 

16|% -- simulate returns 

17| for t = (max(p,q)+1) : T_ compl 

18 r(t) = mu + r(t-lag_p)’ * phi + e(t-lag_q)’ * theta + e(t); 
19| end 

20 

21|% -- discard initial values & compute prices 


22| e(1: (T_compl-T)) = []; 
23| r(1:(T_comp1-T)) = []; 
24| S = exp(cumsum(r)); 


Experimenting with different parameters for dg and 6, can help to get a feeling for the behavior 
of ARMA processes. Trying rather extreme values can underline the particularities. Fig. 8.5 pro- 
vides the returns and corresponding prices for simulations in the case where e; ~ N(0, 0.017) and 
u = 0, but different values for ¢; and 6;. Note that the seed was fixed for each situation, so all 
variants use the same series of e;s: 


randn(’state’,100); 
[S,r,e] = ARMAsim(mu,phi,theta,sigma,T,Tb) ; 


The graph in the center represents a plain vanilla random walk with (geometric) Brownian mo- 
tion since ¢; = 6; = 0; the center column are all MA models, whereas the center row contains 
AR models. Large negative ¢; (left columns) increases negative autocorrelation in the returns, and 
returns and prices will oscillate. Large positive ¢; (rightmost column) will smooth the price pro- 
cess. 0 has similar effects: negative values encourage oscillation, whereas positive values produce 
trends and smoother behavior. When 6; has the same sign as ¢1, these effects will be enhanced; 
opposite signs reduce the effects. Note, however, that @ has a much stronger impact. Also, note how 
volatility clustering can emerge, in particular, when both parameters have equal signs: when neg- 
ative (top left), the magnitude of (absolute) returns seems to build up and slowly decrease again. 
Likewise, when both parameters are positive (bottom right), returns move gradually away from 
the mean (i.e., become more extreme) and only slowly move back to mean values (also, note dif- 
ferent scales of the y-axis for some prices). In the former case, the signs of the returns change 
excessively often, whereas in the latter, sign changes are much less frequent than in the GBM case 
(center). 

The values chosen for ¢; and 6; are rather unusual for real-world time series such as stock 
returns, in particular the negative ones. However, they illustrate well the resulting effects. Large 
values for ¢; and/or 6; produce high autocorrelation, meaning that future returns would be strongly 
determined by what can already be seen in the market, making profits and losses predictable. And 
this is clearly not what one can see in efficient markets. Nonetheless, significant ARMA parameters 
can often be found in empirical data, in particular in interest rates. For Monte Carlo simulations 
that are used for pricing or stress testing, the usual approach would be a data-driven calibration, in 
which the parameters are fitted to existing historical time series and then (potentially with variants) 
used for simulations. 


8.4.6 Models with long-term memory 


A crucial assumption in financial econometrics is stationarity: in a stationary system, any shock 
will die away eventually; in nonstationary systems, it will persist. Unlike returns, prices are usually 
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FIGURE 8.5 Returns and prices for different AR, MA, and ARMA models. 


nonstationary: 


St = Si—1 exp(rr) = (S;—2 exp(r;-1)) expr) = ++ 


t 


t 
= So Il exp(rr) = So exp ite 


ger t=1 
or, even more obviously, when the log prices are considered, 
Ss=Sotritrat:::+r. 


The impact of rı on the price at t might be superposed by the subsequent returns; nonetheless, it is 
still there. When mean returns are positive, stock prices are, therefore, expected to grow, too, and so 


A gentle introduction to financial simulation Chapter | 8 169 


will their mean over time. Stock prices then follow a nonstationary process, and the random walk 
model with a drift in Eq. (8.4) is a very simple example of that: 
Sp=St-1 + M+ er (8.4) 
—_— 
=r; 
=s0o tu +e tutet: + Ut es 
—— — 
=f] =r2 =r, 
t 
=st+tut+ > er. (8.5) 
t=1 
Eq. (8.5) can be made stationary by detrending it; this case is called trend-stationary. 
The random walk with drift (Eq. (8.4)) can be generalized to 
St = P1S-1 + U + er, (8.6) 


which is just another case of an AR(1) model. If ¢; = 1, then this process has a unit root; |¢| > 1 
makes it explosive, whereas 0 < |ġ| < 1 makes it stationary.’ In passing, note that ¢, < 0 should 
actually never occur for any real-world prices. Further note that the use of nonstationary data can 
lead to spurious correlations and regressions. 


8.5 Time-varying volatility 


8.5.1 The concepts 


Basic statistics teaches us that a normally distributed variable, X, can be converted into a standard 
normal one, Z, by first centering it by subtracting the mean and then scaling it: 


X-u 


X~N(u,07) =Z = ~N(0, 1). (8.7) 
This relationship holds both ways: given a standard normal variable Z, rearranging Eq. (8.7) tells 
us how to convert it into one with prespecified mean, u, and variance, ø?. In fact, the relationship 


X=pu+Zo where Z ~N(0,1) and X ~ N(w, 0”) (8.8) 


was already used for the generation of returns in Section 8.3. The assumption then was that all 
returns are n.i.i.d. Introducing moving average and autoregressive models still left the residuals 
n.i.i.d., though the returns were not. 

In real financial time series, we often find volatility clustering: large residuals are followed by 
large residuals, and small residuals by small ones. In other words, one can distinguish phases of 
large and small volatility, and the idea of homoscedasticity (equal variance) should be replaced with 
heteroscedasticity (different variances). To capture this in our models, all one needs to do is add a 
time index to the variance parameter, and then even Eq. (8.7) still holds: 


2 Xt — H 
xe ~ N(u, 07) =z = ais N(O, 1) (8.7*) 
t 
And so does relationship (8.8): 
x; =u +z, where z; ~ N(O, 1) and x; ~ N(u, 07) (8.8*) 


This rather inconspicuous extension is actually an extremely powerful approach. Variance, by 
definition, is the expected squared deviation from the mean. Large variance implies that large de- 
viations are likely, small variance that observations will be close to their expected value. If the 


9. The classical test for a unit root is provided by Dickey and Fuller (1979); for more details on this, econometrics textbooks 
such as Greene (2008) are recommended. 
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FIGURE 8.6 Scaled white noise; dark lines indicate to;. 


magnitude of these deviations varies with time, then this can also be captured by a process that is 
combined with white noise. 

To demonstrate this, let’s assume some extreme examples, which, admittedly, are not typically 
seen in financial time series. If standard deviation grows linearly over time, then o; = Bt. On the 
other hand, o; = sin(t/a) describes a cyclical pattern in volatility. Fig. 8.6 depicts these relation- 
ships: the top panel shows white noise, {zr} ~ N(O, 1); the center and bottom panels show scaled 
versions of it. The dark gray lines indicate one standard deviation above and below the expected 
value, which remains invariably at zero. The noisy ingredient is normally distributed, so roughly 
one-third of the observations should be within the +o confidence interval.! This is true regardless 
of scaling: consider the first panel to be printed on a rubber canvas that, in the other panels, is 
stretched and squeezed vertically. The levels change, but not the relative positions to the respective 
standard deviations. 


8.5.2 Autocorrelated time-varying volatility 


To bring this idea closer to financial problems, let’s try to model some stylized facts: in daily returns, 
high (low) volatility tends to be followed by rather high (low) volatility. Volatility is the square root 
of variance; and variance, as mentioned, is the expected squared deviation from the mean, that, 
the expected squared residual, o = E(e?). The current variance could, therefore, be estimated as a 
weighted sum of recent realizations of squared residuals: 


r,=u+e, where e,=z0;~N(0,07) and z~N(0,1) 
q 
of = œo + X uee, : (8.9) 
t=1 


This approach was suggested by Engle (1982) under the name autoregressive conditional het- 


eroscedasticity (ARCH(qg)) model. In this model, the new estimate for variance, a. is reacting 


to its most recent realization, er. If memory is longer, then an ARMA model for the variances 


10. For a sufficiently large normally distributed sample, approximately 68% of the observations should be within one 
standard deviation; 95.5% within two standard deviations; and 99.7% within three standard deviations around the mean. 
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FIGURE 8.7 Daily returns (gray lines) and jz + 20; confidence bands (dark lines) (left panel) for different combinations 
of a; + 1 =0.95 and prices (right panel) for GARCH(1, 1) processes. 


might be suitable: 
o? =a + 01e? + Bio? , 


A bit of calculus shows that the inclusion of the lagged variance on the right-hand side of the 
equation introduces the previous error terms: oy contains both es and o»; Os contains the 
values from t — 3; and so on. It can, therefore, be seen as a generalized version of the ARCH 
model Eq. (8.9), and in addition to g lagged squared residuals, one could include p lagged variance 
estimates. Not surprisingly, the inventor of this extended model, therefore, called it Generalized 


ARCH (GARCH(p, q)) model (see Bollerslev, 1986): 
q Pp 
o? =o + X uee, + X beckae- (8.10) 


In either case, the parameters for Eqs. (8.9) and (8.10) are estimated by maximizing the log- 
likelihood function! ! 


T 


T 1 2 e? 
L=-7log2r)- >), logtor) + zz 


t=1 


Similar to other ARMA processes, certain restrictions and properties apply to the parameter values: 
First, the sum of all aes and Bes must not exceed 1, otherwise the process becomes explosive and 
the variance estimate converges to infinity. Second, all as and s should be positive, otherwise 
the estimate for variance could become negative—and the volatility an imaginary number. Third, 


Zr = €r /0, is normally distributed, but e; is not. In fact, it will have kurtosis of 2 Tea a ~ D 
1P17 321 


Its unconditional variance, by the way, is ao / (1 — X p œe — X}; Be). 
The aes govern the short-term impacts of shocks, the Bgs model longer persistence. To see the 
effects of different parameter values, compare the plots in Fig. 8.7. To make the effects of different 


11. Maximizing this log-likelihood function is not as smooth and well behaved as often assumed; the use of heuristics 
is therefore highly recommended. Maringer (2005b) and Winker and Maringer (2009) provide more on this, as does Sec- 
tion 16.3.2. 
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parameter settings more comparable, all processes use the same simulated sequence of z;s; also, 
they all have the same unconditional variance. The higher a1, the more a shock on day ¢ will drive 
up next day’s volatility. However, a lower 6; will make this shock short lived. Alternatively, if 
61 is high (and a is low), it needs some rather large shocks to noticeably affect the next day’s 
volatility, but if it does, it takes longer to decay. Hence, larger as make the volatility process more 
spiky and ragged, favoring rapid changes in the volatility; more emphasis on the fs, on the other 
hand, smoothes the volatility process and produces longer swings. The extreme case here would be 
ag = 0, where innovations never enter the volatility process and variance becomes constant over 
time (top panel). 

Keeping £ı fixed while varying a1 or vice versa will have different effects. The reader is en- 
couraged to use the MATLAB code provided below to perform their own experiments. Also, be sure 
that none of the underlying assumptions is violated (a; + £1 < 1, etc.), otherwise you might be in 
for a surprise. 

For stock returns, a + £ is typically > 0.9. Also, the main emphasis is usually on the long-term 
GARCH parameter, 61, with typical values of 0.8 and above. In times of massive turmoil in the 
market, a; can exceed 0.1, but more often than not, it is below that value. In practical applications, 
however, one often finds that the parameters are not stable over time. This is not unusual for any 
time-series model; and most of the time, the parameters vary within “reasonable” limits. Consider, 
for example, the FTSE returns, which have seen some dramatic periods recently.'* For the three 
years spanning July 1, 2004 to June 30, 2007, the MATLAB toolbox’? fits the daily log returns with 


ri = 0.00065 + ez, et ~ N(0, ht) 
(0.00023*) 


hi = 3.0e —6 + 0.09 e? | + 0.84 h1, 
(1.08e—6*) — (0.023*) (0.04*) 


whereas for the subsequent three years from July 1, 2007 to June 30, 2010, the estimates are 


ri = 0.00037 + e,, e, ~N(O,h;) 


(0.00051) 
hy =5.1e-—6+ 0.12 e? + 0.87 Ay-1. 
(2.2e—6*) (0.021*) (0.022*) 


For the entire 6 years, however, the fitted model is 


rı = 0.00059 +e, e,~ NOO, Az) 


(0.0002 1*) 
hy =91.le —6+ 0.11 e7_, + 0.89 Ay-1. 
(32.18e—6*) (0.013*) (0.012*) 


(Values underneath parameters are standard deviations; * indicates significance at the usual 5%.) 
As can be seen, the parameters a and 6, do not change dramatically. !4 

If the assumption of aGARCH process holds, the standardized innovations, z; = e; /./h;, ought 
to be standard normally distributed. Fig. 8.8 plots the actual returns and the standardized residuals 
for the FTSE data set introduced above, as well as for S&P 500 returns. In either case, most of 
the volatility clustering can be filtered out (bottom panel). Running a Kolmogorow-Smirnov test 
shows, that for the FTSE data, the z;s actually can be regarded as normally distributed (p value 
0.076), whereas for the S&P 500, this must be rejected (p value 0.001). Neither series passes the 
Jarque—Berra test for normality. 


12. The following results are based on adjusted prices as downloaded from finance.yahoo.com. The symbol for the FTSE 
is “F TSE, and the one for the S&P 500 used later, is ~“GSPC. 

13. In previous versions, the syntax was [Coeff, Errors, LLH, Innovations, Sigmas, Summary] = 
garchfit (r).In more recent versions, the equivalent callismdl = garch(’GARCH’,NaN, ‘ARCH’ , NaN, ’off- 
set’,NaN); fit = estimate(mdl, r), and the variable fit contains all results as well as some helpful func- 
tions such as fit.infer(r) (to infer conditional variances of time series r with fitted parameters) and fit .fore- 
cast (n) (to forecast the conditional variance for the subsequent n observations). 

14. Things look slightly different, though, when shorter time windows are used. Fitting the model to individual years 2000, 
2001, ... , 2009 yields 61s in the range of 0.75 to 0.95, and apart from a few exceptions, a; + 6; adds up to something in 
the range of 0.95 to 1. 
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FIGURE 8.8 Actual daily log returns for the FTSE and S&P 500 for the period Jan 2000 to Dec 2009 and +2./h; (top 
panel), and innovations standardized using time-varying volatility, vAr, as fitted with a GARCH(1, 1) (bottom panel). 


Also, all the parameters in the variance process are significant, suggesting that they do capture 
autocorrelation patterns in the variance process. However, this does not mean that these models 
capture everything that’s going on in the data. To illustrate what is left out, again a simulation can 
help where first the model is fitted to an empirical data set, and then simulations are performed 
based on it. 


8.5.3 Simulating GARCH processes 


To generate a sample path that follows a GARCH process, one can again follow the “reverse en- 
gineering” principle: the main ingredient of a GARCH model is the variance process h;; dividing 
the residuals of the returns e; by its square root, ./h;, is equivalent to standardizing them, and 
Zt = e;//h; should follow a standard normal distribution. The simulation works its way in the op- 
posite direction: for any period t along the sample path, one needs to compute the current variance 
based on the previous variance and realized innovation, h; = a + ayer | + Byh;—1, and draw a 
sample, z; ~ N(0, 1). One then can compute the current innovation, e; = z;./h;. These new values 
for h, and e; can then be used to compute ;+1, which, combined with a new sample z;+1, gives the 
next innovation, e;+;, and so on. With the innovations ready, adding the drift u provides the return 
series r; = U + êr. 

To get the ball rolling, however, initial values for the first period’s variance and innovation need 
to be provided. There are three approaches that are commonly used: 


1. Starting values are the unconditional statistics and values. For GARCH models, one can use the 
unconditional variance for hı = ao/(1 — (a1 + £1)) with which one can generate e1. 

2. The downside of this is that the first draws from repeated experiments will look rather similar; 
this can be avoided, though. After initializations, a sequence of blank observations are generated 
to allow the system to swing in. For example, in order to simulate a path of T = 1000 observa- 
tions, one generates a total of Tp + T periods, but reports only the last T periods. The idea is 
that over the first Tp samples, the processes have diverged sufficiently. Imagine it as starting the 
simulation in the past at t = — (Tp + 1). This approach can also be used as an ad hoc solution in 
the absence of reasonable starting solutions. Starting with arbitrary initial values and discarding 
the early values again is often a good enough approach. 

3. If one wants to generate future scenarios for an existing price process, one can use the current 
values for the respective variables. For example, if one wants to simulate what could happen 
over the next month, this approach would fit the parameters onto recent observations; estimate 
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the current variance and use this value as the value for h;; even better, when eg is already 

available, it computes hı according to definition by using real data for họ and eo. 

The following MATLAB code produces a vector of returns of length 7. By default, it starts with 
the unconditional variance. If one wants the system to swing in first, one can provide as a sixth 
argument the number of additional blank samples; these will not be returned. The sample price 
process can then be computed via the relationship $; = S;—1 exp(rr). 


Listing 8.8: C-FinancialSimulations/M/./Ch08/GARCHsim.m 


1| function [r,e,h] = GARCHsim(mu,a0,al,b1,T,Tb) 

2|% GARCHsim.m -- version 2011-01-06 

3| if nargin < 6, Tb = 0; end % no "before" periods to swing in 
4|% -- initialize variables 

5}z = randn(T+Tb,1); 

6|e = zeros(T+Tb,1); 

7|h = zeros(T+Tb,1); 

8)/h(1) = a0/(1-(al+b1)); 

9)e(1) = z(1) * sqrt(h(1)); 

10|% -- generate sample variances and innovations 

11| for t = 2: (T+Tb) 

12 h(t) = a0 + al x e(t-1)%2 + b1 * h(t-1); 

13 e(t) = z(t) * sqrt(h(t)); 

14) end 

15|% -- remove excess observations from initialization phase 
16| e(1:Tb,:) = []; 

17) h(1:Tb,:) = []; 

18|% -- compute returns 


19)r = e + mu; 


Continuing our example for the FTSE, let’s consider the previous results for the entire data set of 
six years. Fig. 8.9 depicts the actual return process and two simulations. Both actual and simulated 
returns show volatility clustering, yet at different times. By design, high volatility periods can build 
up as a consequence of large shocks; since the trigger for such a period occurs randomly, one 
cannot expect that real and simulated returns have their high-risk episodes synchronized. What is 
more important, however, is that the simulated returns sometimes lack the extreme spikes. The 
latter happens even more often, when a is rather low; values of 0.05 can occur for real stock and 
index prices; having simulations without excessively turbulent periods are then the rule rather than 
the exception. Running some experiments can help illustrate this. 

This underlines another important issue. The more sophisticated the model, the more diverse 
the outcome from independent experiments can be. It is, therefore, of paramount importance to run 
many replications in order to see the overall picture. For GARCH simulations, this diversity usually 


shows in the presence or absence of fat tails; the (unconditional) kurtosis of the e;s can (and usually 
Elet) _ 3(1-a?), 
E(e?)2 — 13a? ’ 

variance of the e;s that, theoretically, ought to be E(e?) = Cena Since the numbers of draws 
per experiment is usually predetermined by the length of the sample path, it is important to have a 
sufficiently large number of experiments to cater to these issues. 


will) deviate from its theoretical value 


the same is true for the (unconditional) 


Actual FTSE returns Simulated returns Simulated returns 
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FIGURE 8.9 Actual FTSE daily log returns for July 2004 to June 2010 (left) and two simulations based on a GARCH 
model fitted on actual data (center and right; 2525 observations each; seeds fixed with randn(’seed’,10) and 
randn(’seed’ , 50), respectively). 
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FIGURE 8.10 Kernel densities for cumulated returns (250 days; left panel) and prices (right panel) where daily returns 
follow a GARCH(1, 1) process with a; € {0.05, 0.20.350.95} and 6; = 0.95 — a, (darker lines indicate lower a1). 


As mentioned on several occasions, returns following a GARCH process will have excess kur- 
tosis: not only per period, but also the cumulated returns, that is, the relative price change over the 
entire sample period, are also likely to be nonnormal. 

Fig. 8.10 illustrates this. Based on 50,000 simulations per parameter constellation, the left panel 
shows the distribution of cumulated log returns that are equal to the log return over the entire period: 


T T 


Yor = Yo log(S:/S:-1) = }log(S,) — log(S;-1) = log(Sr/S0). 


t t=1 t=1 


As was the case when we analyzed the distribution over time of a single path (Fig. 8.7), setting ay = 
0 blanks out any impact innovations have on the volatility process, and the normally distributed 
innovations will translate into normally distributed returns. Increasing a, while keeping a; + $1 
constant will increase kurtosis and make not only the distribution of daily returns leptokurtic but 
also the cumulated returns. Likewise, the distribution of terminal stock prices will move away from 
a lognormal distribution (Fig. 8.10, right panel). 


8.5.4 Selected further autoregressive volatility models 


The GARCH model was a major extension to the original ARCH model, and for most practical 
purposes, the GARCH(1, 1) seems to fit reasonably well (see, e.g., Lunde and Hansen, 2005). 
However, this has not stopped academics from creating variants that capture other stylized facts or 
the particularities of certain assets. For a very broad (but still far from exhaustive) survey, Engle 
(2002) is highly recommended. Also, some textbooks on (financial) econometrics cover the most 
prominent approaches; see, for example, Brooks (2008). 


Additional explanatory variables for returns 


In the basic GARCH version, the best guess for the next realization of the dependent variable 
(usually the return) is its mean: E(r;) = E(u + zh) = u. In many cases, however, returns will 
be driven by—or at least correlate with—other factors f; the usual suspects are market returns, 
regional developments, industry sector performance, etc. In that case (and under the assumption of 
linear dependencies), the returns could be modeled as 


ri =p0+ 9 firbi ter, where e~N(O,h;). 
i 


When using this model for simulations, one can distinguish two situations: either the realizations 
for the f; +s are given, or they have to be simulated as well. The former would be the case if, for 
example, one wants to test different scenarios for one asset while controlling for the circumstances. 
When these factors have been generated with some other simulation (including parametric models 
and bootstraps), the process of simulating r; would be extended by generating the f; +s first. 

As a special case in this class, one can consider processes where the returns follow an ARMA 
process. 
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I-GARCH 

Quite often, the parameters dealing with the persistence of shocks add up to something (almost) 
equal to one. The Integrated GARCH model introduces this as a condition: `; œe + °y Be =1.In 
the I-GARCH(1,1) incarnation, this means that, by default, 8; = 1 — a. As a consequence, there’s 
one parameter less to estimate (and to provide when simulating). At the same time, the variance 


process now is integrated and has a unit root. 
For simulation purposes, function GARCHsim.m can be modified by setting Statement 12 to 


h(t) = a0 + al x e(t-1)*%2 + (1 - al) +» h(t-1); 


GARCH-M 


Empirical evidence suggests that, in riskier times, assets also pay higher returns. In this case, the 
levels of returns should be directly linked to the level of risk: 


rp =UtAJsh; +e, where e, ~N(0,h;), 


where h; follows a GARCH or ARCH process. Note that alternative versions use the variance h; 
instead of standard deviation ./h; as an explanatory variable. Also, it is not uncommon to have 
additional explanatory variables included in the model state, much in the sense of the version on 
page 175. 


function [r,e,h] = GARCHMsim(mu, lambda,a0,al,b1,T, Tb) 
for t = 2: (T+Tb) 
h(t) = a0 + al * e(t-1)%2 + b1 * h(t-1); 
e(t) = z(t) * sqrt(h(t)); 
end 
% -- compute returns 


r = mu + lambda + sqrt(h) * e; 


GJR-GARCH 


A stylized fact for asset returns is that market participants are usually more receptive to bad news 
than to good. A negative innovation, e; < 0, will have a bigger impact on the next period’s volatility 
than a positive innovation. In other words, the sign of e; matters. Glosten et al. (1993) suggested an 
asymmetric version modeling exactly that: 


2 . 
aye if e;_; > 0 
hp=ogt Bhat} ft Ue 
(ai t+ ier, ife1 <0 


= a9 + Bhi_1 + (01 + be, <0) 2-1, 


where Te, <0 is the indicator function, returning 1 if e;_; < 0, and 0 otherwise. ¢; increases with 
the strength of these asymmetries. 


function [r,e,h] = GJRGARCHsim(mu,a0,al,phil,b1,T,Tb) 
for t = 2: (T+Tb) 
h(t) = a0 + (al + phil*(e(t-1)<0))* e(t-1)*2 + b1 * h(t-1); 
e(t) = z(t) * sqrt(h(t)); 
end 
T-GARCH 


The Threshold GARCH model works along similar lines: positive shocks will have a different 
impact than negative ones. The main twist here is that it models the standard deviation rather than 
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the variance: 


Vh =a9 + (af +a] le <0) ler] + S1V/Ar-1 - 


Here (and for most other asymmetric models), valid parameters must ensure that the new estimate 
for the standard deviation, ~y Ar (or, later, for the variance, h+), does not become negative. 


function [r,e,h] = TGARCHsim(mu,a0,alp,alm,d1,T, Tb) 


sh = zeros(T+Tb,1); 


sh(1) = sqrt(a0/(1 - (alp + 0.5*alm + bl1))); 
for t = 2: (T+Tb) 
sh(t) = abs(a0 + (alp + alm*(e(t-1)<0))* abs(e(t-1)%2) + ... 
dl * sh(t-1)); 
e(t) = z(t) * sh(t); 
end 
E-GARCH 


In the original version, the updated variance is linearly dependent on its previous value. If volatility 
is already high, this might be too strong. Nelson (1991), therefore, suggests an Exponential GARCH 
model to use the logs of the variances rather than their actual levels. In addition, it also caters to 
asymmetric effects similar to the GJR-GARCH. The original E-GARCH model uses a generalized 
error distribution; another incarnation (cf., e.g., Brooks, 2008, p. 406), reads 


er-1 lér—1| 
log(hy) = æo + Bihy-1 +0 +( Jin) ; 
‘ ' ~hi Vhi-1 
— 


=i] 


If negative innovations now increase the variance stronger than positive ones, y will be negative. 
Also, in a passing note, there are fewer restrictions on the parameters: in previous variants, param- 
eters must be chosen so that they ensure only positive values of h;. The E-GARCH models the logs 
of the variances; negative values will then be simply translated into small variances. 


function [r,e,h] = EGARCHsim(mu,a0,al,gamma,b1,T,Tb) 


Inh = zeros (T+Tb,1); 
Inh(1) = sqrt(a0/(1 - ( alp + 0.5*alm + bl1))); 


for t = 2:(T+Tb) 
Inh(t) = a0 + b1 x exp(h(t-1)) + al * z(t-1) +... 
gamma (abs (e(t-1))/lnh(t-1) - sqrt(2/pi)) ; 
e(t) = z(t) * sqrt(exp(h(t))); 


N-GARCH 


Another way of modeling the relationship between innovation and impact on the new variance 
estimate was introduced by Engle and Ng (1993). This Nonlinear-GARCH does not (necessarily) 
check the sign of the recent shock but puts it in relation to the volatility at that point in time: 


2 
hy =a + a (e +yy hit) + Byhy-1. 


If the parameter y is equal to 0, then this model collapses into a GARCH(1, 1) model; if y > 0, 
however, negative innovations will have a larger impact. In theory, negative values for y could also 
occur; yet it is rare in real markets that price increases increase the volatility more than comparable 
losses. 
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function [r,e,h] = NGARCHsim(mu,a0,al,gamma,b1,T,Tb) 


for t = 2: (T+Tb) 


h(t) = a0 + al * (e(t-1) + gamma * sqrt(h(t-1)))*2 + ... 
b1 * h(t-1); 
e(t) = z(t) * sqrt(exp(h(t))); 
end 


Further extensions 


In addition to the GARCH variants discussed above, numerous further contributions exist. Typi- 
cally, they cater to different stylized facts observed for empirical data. More recent additions, for 
example, allow for nonnormal distributions in the innovations or make also the higher moments 
conditional and time-varying. Most of them do have their merits; however, other variants have also 
been counted in the class of YA-GARCH models—“yet another GARCH.” As already discussed 
in the above examples, the new extensions come with additional parameters. Sometimes, these 
new models nest previous ones, and choosing particular parameter values turns them into one of 
their cousins or predecessors, as was the case with N-GARCH. Some models, such as Augmented 
GARCH (see Bera et al., 1992) are constructed to nest many different variants. Discussing them 
here, however, would exceed the purpose of this section, and the interested reader is kindly referred 
to the literature. Creating simulations based on these models is in most cases straightforward and 
follows the same general principles as the examples provided above. 


8.6 Adaptive expectations and patterns in price processes 
8.6.1 Price—-earnings models 


One of the central concepts in asset pricing is that the current fundamental value ought to represent 
the discounted value of all future payments. In the case of stocks, future payments can occur either 
as dividends (including other issued rights) or as proceeds from selling on the stock itself. Based 
on expectations at time ¢ for the next period, the fundamental relationship 


D S, 
s=6 (Fatt Ht) 


should hold, where D;+1 and S;+, are the future dividend payment and the stock price, respectively, 
and ry is the current discount rate for period t. Obviously, the same argument should hold for S;+1 
and all subsequent prices. Shifting the time indices and substituting them one after the next one 
into the fundamental relationship, one gets 


> D 
FE 
2 
t=1 

Admittedly, it would be very brave to claim one can predict a stock’s dividend until eternity. For 
reasonable discount factors, however, it is mainly the first dividends that contribute the most to the 
present value. Also, for a rough guess, it is often good enough to assume that the discount rate is 
constant, rz =r, and that the dividends either remain constant or grow following a simple process. 
Gordon (1962) famously suggested a variant where dividends grow at a constant rate, D; = gD;-1. 
Assuming a constant discount factor r, the fundamental value for a stock would then be 


=D = : 
(+r) 'I+r-g l+r-g 


CO 
Dg" g Di+ı 
oe 2 
t=1 


In his honor, this model is named the Gordon growth model. Note that this and other models based 
on the price—earnings ratio assume that companies pay out all their profits. If earnings are retained 
and no dividend is payed, a rational bubble could result. In Eq. (8.6), this would come with ¢; > 1. 
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A straightforward implication of this model is that the price—dividend ratio should equal 
S;/Di41 = 1/(r — (g — 1)). Hence, if the price—dividend ratio is, say, 20 (implying that the discount 
factor exceeds the growth factor by 1 +r — g = 5%), the current price should be S; = 20D;+1. As- 
suming average discount rates and growth factors are constant over time is equivalent to saying that 
the long-run average price—dividend ratio is constant. In that case, variability in the prices should 
be down to variability in the dividends only. In particular, a 1% change in the dividend should lead 
to a 1% change in the stock price. 

Empirical tests, however, show that this does not hold in the real world. Barsky and DeLong 
(1993) found for S&P 500 stocks that there is excess volatility in the stock prices: for the period 
1949-1969, for example, they found that dividends, on average, grew by 0.72%, whereas stock 
prices went up by 1.6%. Their explanation for this phenomenon is that market participants have 
adaptive expectations: investors do have assumptions about the growth rate, but they update it 
gradually if the recent change in log dividends deviates from their expectations. Replacing the 
constant growth rate in the original Gordon growth model with a time-varying one that reflects this 
behavior produces price processes with excess volatility. 


8.6.2 Models with learning 


At the same time, Timmermann (1993) published a model with a slightly different approach. His 
assumption is that changes in the log dividends are normally distributed 


log(D;/D;—1) = Alog(D;) = u + € where e, ~ N(0, o°). 


The problem in real life is that investors cannot observe the true parameters u and o*. What they 
can do is to estimate them using the previous n empirical observations. As time progresses, new ob- 
servations emerge, and adaptive learning will take place where the updated estimates are a weighted 
combination of previous estimates and realized values: 


A n=l., 1 
Ut = Mt-1 + z A los(Dr) 


n 


n2 n=l, 1/n-1 N 2 
ô? = “6? + — ( ——(Alog(D,) — fr”) . 


The expected growth, the next dividend, and the fundamental price, respectively, are then given by 


62 
&t = exp (1 + $) 


E, (Di+1) = Dir = D; exp( fr + 67/2) 


p= D,(#.). 
l+r-% 


As for the Gordon growth model, the growth factor has to be less than the discount factor, g; < 1+r. 


Listing 8.9: C-FinancialSimulations/M/./Ch08/Timmermann.m 


1] function [S_RE,Div,g_est,S_FT] = Timmermann(T,r,mu,sigma,n) 
2|% Timmermann.m -- version 2011-01-06 

3) g_true = mu + randn(T+n,1) * sigma; 

4| mu_hat = zeros(T+n,1); var_hat = zeros(T+n,1); 
5}mu_hat(n+1,1) = mean(g_true(1:n)); 

6| var_hat(n+1,1) = var(g_true(1:n)); 

7| log_div(nt+1,1) = 0; 

8| weight = (n-1)/n; 

9| for t = (n+2):(T+n); 

10 mu_hat(t) = weight » mu_hat(t-1) + (1/n) * g_true(t); 
11 var_hat(t) = weight *« var_hat(t-1) + (1/n) * ( weight *« .. 
12 ( g_true(t)-mu_hat(t-1))%*2); 

13 log_div(t) = log_div(t-1) + g_true(t); 
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14) end; 

15|% -- discard first n observation (preceeding observations) 
16| mu_hat (1:n) = []; var_hat(1:n) = []; log_div(1:n) = []; 
17|% -- compute prices 

18| Div = exp(log_div); 


19| g_est = exp(mu_hat + var_hat/2); 

20| g_est = min(g_est, (1 + r -.0001)); 

21 -- price under rational expectations 

22| S_RE = Div .* (g_est ./ (1+r-g_est)); 

23|% -- fundamental price under known true parameters 

24| S_FT = Div .* (exp(mu + sigma*2/2) /(1+r-exp(mut+sigma’2/2))); 
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FIGURE 8.11 Stock prices under rational expectations from the Timmermann (1993) model (n = 25, u = 0.01, o = 0.05). 
Rational expectation prices (thin gray line) and fundamental prices with known parameters (thick gray line). 


Again, the factor which multiplies D; can be read as the price—dividend ratio (PDR). When 
estimations are based on a small number of observations, ¢; will be volatile. A positive shock in the 
dividend will then have an enhanced effect: not only will the current dividend go up substantially, 
but investors will also be fast to adapt their g,. In other words, their expectations about PDR will 
go up, and they will be overly optimistic. Negative shocks will have the opposite effect. Fig. 8.11 
illustrates this. In the long run, low n will lead to overreactions compared with the fundamental 
price if parameters were known. If n is large, on the other hand, then the same jump in the dividend 
will not be followed by a sharp change in g;, and the expected PDR will react rather sluggishly. The 
volatility of the stock prices will then reflect the changes in the dividend, but there should be little 
excess volatility in the stock price compared with fundamental prices under known parameters. 
Timmermann also performs a series of Monte Carlo simulations for different parameter settings. 
The results confirm that shorter lookback periods increase volatility in the stock returns. 

This model has inspired a large community and has been the main ingredient in a series of 
subsequent models. Most notably, Timmermann (1996) suggested a model where dividends are 
trend-stationary: 


Di=pD1+u+yt+e, where e ~N0,0°). 


Lewellen and Shanken (2002) extend this approach to an equilibrium model. They confirm that 
patterns emerge in simulated data; however, they also find that these patterns are an ex post phe- 
nomenon but do not allow out-of-sample predictions. 


8.7 Historical simulation 
8.7.1 Backtesting 


When a new strategy is evaluated by using historical data to find out how well it would have 
done, then this is a typical example for backtesting. The underlying assumption is that the available 
data are representative and independent of what is to be tested. This ceteris paribus assumption is 
particularly relevant when, for example, investment strategies are evaluated, but the orders had a 
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noticeable impact on prices—which, for algo-traders in high-frequency markets, is a serious issue. 
The general procedure is to split the data in two subsamples where one is used for calibration (if 
necessary) and the remaining data for validation. 

Backtesting is heavily used for risk measurement and reporting. In the presence of a sufficiently 
large number of representative observations, the reported risk figures are then simply based on 
what the risk of the given positions would have been in the past. For the case of Value-at-Risk (a 
quantile risk measure required by the Basel II and Basel III accords), one usually computes what 
the value in the past would have been given the current composition of portfolios, and then reports 
the corresponding quantile of the empirical distribution. 

Another common application of backtesting is index tracking. The purpose of this approach is to 
hold a portfolio that behaves exactly like the benchmark index. If perfect replication is impossible 
or undesirable, the tracking portfolio can be constructed so that it mimics the benchmark as well as 
possible. The quality is measured by the tracking error, usually the mean squared deviation between 
the portfolio’s and the benchmark’s returns.'> Mostly, index tracking does not take parametric es- 
timations of asset and benchmark returns but uses past observations directly. The optimal portfolio 
weights are then found by comparing the past benchmark realizations and how the tracking port- 
folio would have performed under the chosen weights. To lower transaction costs and monitoring 
requirements, it is common to limit the number of different assets (see Maringer and Oyewumi, 
2007). When the number of assets to choose from is large relative to the number of available (and 
useable!) data, overfitting issues could arise. One way of overcoming this is the introduction of 
additional limits on the portfolio weights (see Zhang and Maringer, 2009). 

A main advantage of backtesting is that it does not require parametric distributions and can 
therefore use actual observations directly. This, however, can also be one of its main shortfalls. 
If there are not enough historical samples, one can supplement them with MC-simulated ones. 
Depending on the problem (and data) at hand, there are several approaches that can be used: 


e The further back the observations date, the higher the chance that fundamentals have changed 
and the observations are less representative for current situations. Transforming and “correcting” 
for the change can then help; examples for this are scaling to include changes in the volatility 
(see page 193) or asset-specific shifts such as accrued coupons or shifts in interest rates. 

e If there are no data for the asset itself, but sufficient observations for the factors it depends on, 
historical samples can be constructed. For example, bond prices can be reconstructed if the yield 
to maturity time is available. Likewise, in the presence of reliable pricing models and data, prices 
for futures and other options can be simulated. In fact, there are situations where such constructed 
samples are superior to actual ones. Just think of bonds: to estimate the Value-at-Risk of an 
option, one is well advised not to use raw historical observations, but prices simulated under 
current market conditions. This approach leans toward bootstrapping, which will be discussed 
in Section 8.7.2. 

e Making (semi)parametric assumptions about the underlying processes and fitting suitable models 
to available data can deliver data-generating processes from which random samples are drawn. 
Section 8.5.2 provides examples for this. Note, however, that the generated data will be only as 
good as the models are. In particular, subtleties in the dependence structures across many assets 
can be lost. 


Chapter 15 is entirely devoted to the topic of backtesting. 


8.7.2 Bootstrap 


Bootstrapping is a method to simulate new observations based on an existing sample. A key as- 
sumption is that the data are stationary, and this is clearly not true for price series. If a stock price 
follows a geometric Brownian motion, however, it will be first-order stationary since the log differ- 
ences will be i.i.d. The bootstrap should then focus on the (stationary) price changes, which then 
can be translated into a new price process. The MATLAB code bootstrapprice.m performs 


15. For a general survey of index tracking approaches and how to approach them with heuristic optimization, see di Tollo 
and Maringer (2009). 
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FIGURE 8.12 Bootstrap samples for FTSE data. 


this task. Fig. 8.12 shows the original prices and returns for the FTSE for Jan 2007 to Sep 2010 
(947 days) and the bootstrap samples for different block lengths. As expected, low values for b 
break up the volatility-clustering properties; on the other hand, higher values of b maintain it but 
produce noticeable stretches of identical behavior. 


Listing 8.10: C-FinancialSimulations/M/./Ch08/bootstrapPrice.m 


l| function [S_bs,r_bs] = bootstrapPrice(S,n,b,S_0) 

2|% bootstrapPrice.m -- version 2011-01-06 

3| % returns one bootstrap price path 

4} & SERN original price series 

5| % Te parat length of bootstrap sample x 

6| 3 De se ees block length 

71% S_0 ... initial price for simulation (S(T) if not provided) 
8| if nargin < 4, S_0 = S(end,:); end 

9 

10| r = diff(log(S)); % log returns 


ll|r_bs = bootstrap(r,n,b); 
12| r_bs_c = cumsum([log(S_0); r_bs]); 
= exp(r_bs_c); 
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If the return process also is nonstationary but has some memory, one can first compute the 
changes in the returns, use these bootstrapped data to simulate the returns, and then finally compute 
the asset prices. A typical application for this would be bond prices that often are second-order 
stationary. The yield to maturity (YTM) is usually not stationary, but the changes in the YTM are. 
On the downside, it quickly increases the computational load and other effects could be lost. If the 
memory in the return series is not very strong, then a block bootstrap can be an efficient and good 
enough way to capture these time dependencies. This is the approach often chosen for stock prices. 
As already discussed in Section 6.7.2, there is no clear guidance on how to choose the block length. 
If there is little or no memory, short block lengths should do; with stronger memory, longer block 
lengths might be necessary, yet at the risk of having rather identical new samples coming from 
overlapping blocks. Alternatively, a data-driven, parametric bootstrap can be performed where an 
(econometric) model for the (assumed) temporal patterns is built and calibrated on the data, and the 
new samples are then drawn using this model (with the residuals coming either from a parametric 
distribution or from the empirical distribution of the original residuals). In either case, it is important 
to consider dividends, stock splits, etc., and to work either with adjusted returns or with adjusted 
prices. 

As mentioned earlier, bootstraps can preserve some of the cross-temporal as well as cross- 
sectional dependencies without relying on parametric or semiparametric models. This can be useful 
when, for example, one is interested in the properties of portfolios. Consider the case of an investor 
who is particularly concerned with high kurtosis and negative skewness—a combination hinting at 
potentially large negative events. The bootstrap procedure for a bundle of assets is a straightforward 
extension from the single asset case: after randomly picking a past date, instead of taking just one 
asset’s return, we take all assets’ returns on that day, likewise for the block bootstrap. In fact, the 
bootstrap algorithm suggested above already does that. 

To illustrate the basic workings, assume that one is interested in three of the major European 
stock market indices: FTSE (UK), DAX (Germany), and EuroStoxx50 (Europe). For the sake of 
simplicity, we assume the observations from August 2005 to July 2010 form a representative sample 
and that we have a total of 1251 days, where prices for all three indices are available.!° When 
looking at daily log returns, the investor can compute the first four moments for the indices and 
find the following: 


Mean Volatility Skewness Kurtosis 
FTSE —0.05e-4 0.0146 —0.092 10.35 
DAX 1.83e-4 0.0156 0.162 10.56 
STOXX —1.53e-4 0.0163 0.199 10.76 
Portfolio 0.20e-4 0.0149 —0.165 8.91 


Not uncommon for stocks and stock indices is excess kurtosis (if returns were normally dis- 
tributed, we would expect kurtosis to be 3). Also, the volatility of the stocks is lower than the linear 
combination of volatilities: there is diversification despite the high correlation between the indices 
(linear correlations between log returns are all around 0.9). What might be surprising, though, is 
that the skewness of the portfolio is lower than that of the constituents. If negative shocks tend to 
come simultaneously while positive shocks are not, then losses will not be diversified as nicely as 
profits. 

All these statistics are on daily basis. If investors have a horizon of more than one day in mind, 
then they need to know the statistics of cumulated returns. For example, when large negative shocks 
tend to be outbalanced by large positive ones, then over longer horizons, the skewness will go to 
zero, and so on. To compute such statistics, bootstraps can be performed. Assume that investors 
have $1 to invest and are interested in an equally weighted portfolio and considers a buy-and-hold 
strategy for investment horizons of 1 day, 5 days (i.e., one week), 21 days (approx. one month), and 
62 days (approx. one quarter), respectively. For an equally weighted unit portfolio, they split their 


16. Data have been downloaded from finance.yahoo.com. 
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FIGURE 8.13 Moments of a buy-and-hold portfolio (thick gray lines) and its constituents (thin black lines; FTSE, DAX, 
and EuroStoxx) over different lengths of investment horizons (x-axis: T = 1, 5, 21, 62 days; corresponding to 1 day, 1 week, 
1 month, 1 quarter); based on original data August 2005 to July 2010 and 1,000,000 bootstraps. Top panel, block length 
b = 1; center panel, block length b = 5; and bottom panel, block length b = T. 


initial endowment of 1 into individual investments of E = [!/3 1/3 1/3]. Using recently introduced 
functions, one bootstrap simulation for a horizon of T with block length b can be performed with 


bootstrapPrice(S,T,b,E); 


where S is the vector with historic prices. The function returns a vector with the simulated values 
of the three constituents of the portfolio. To get an idea about the distribution, N such simulations 
have to be performed: 


for sim = 1:N 
I_bs(sim,:) = bootstrapPrice(S,T,b,E); 
end 


The terminal value of the portfolio is the sum of its constituents’ terminal prices. Once all N 
simulations have been performed, 


P = I_bs * ones(3,1); 


computes exactly that. With a total initial investment of 1, the return is simply the log of the terminal 
price: r = log(Pr/Po) = log(Pr/1): 


rP = log(P); 


These returns can then be finally investigated. 

Based on 1,000,000 bootstraps, Fig. 8.13 depicts how the volatility, skewness, and kurtosis of 
the three indices and an equally weighted portfolio of these three develop (returns and volatilities 
scaled to per-day levels). In the top panel of Fig. 8.13, the block length is just 1 day, implying 
that there are no dependencies over time. As expected, investments over longer stretches of time 
have less kurtosis: while a single-day investment has a kurtosis of around 9; the kurtosis of weekly 
returns (5 days) comes down to about 4, and for a monthly (approx. 21 days) and quarterly (approx. 
62 days) returns exhibit almost no excess kurtosis. When averaging returns over longer periods, the 
skewness of individual indices also converges to that of a Gaussian distribution, whereas portfolio 
returns are symmetrically distributed even on a daily basis. 
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Using longer block lengths preserves some of the time-series properties. The center panel uses 
a block length of 5 days (or 1 day for single-day investments). In the bottom panel, block length 
corresponds to the investment horizon. Excess kurtosis still diminishes with longer block lengths, 
yet at a slower pace. Skewness, however, plunges below zero. In fact, the typical monthly invest- 
ment had a skewness of approximately —1, which is far from a normal distribution.This should 
be worrying since, according to usual assumptions in utility analysis, investors do not like negative 
shocks.!’ At the same time, longer investments have lower volatility, implying that there are certain 
serial dependencies. Note that the mean return is hardly affected. Results in the bottom panel can be 
read as typical historical outcomes for an investment over the specified investment horizon. Com- 
pared with the outcomes with shorter block length, it is apparent that ignoring these time-series 
properties can lead to substantial errors of judgment. 

Similar to higher-order stationary price processes, the time series for option prices can have 
rather challenging properties: the first-order differences will be nonstationary (in-the-money op- 
tions behave differently than out-of-the-money ones), and the time to maturity will play an impor- 
tant role. In the presence of a reliable option pricing model, the standard approach here would be to 
bootstrap the underlying and compute the derivative prices according to the model. One could even 
stretch this concept and introduce empirical aspects such as the volatility smile that is often a func- 
tion of the option’s moneyness. When interested only in the derivative’s price at maturity, things 
become slightly easier: simulating the underlying at maturity and then computing the option’s inner 
value is sufficient. Note, however, that this does allow estimating the density of the derivative at 
maturity, but it is not useful for pricing the derivative, in particular if premature exercise is possible. 


8.8 Agent-based models and complexity 


To understand the behavior of asset prices, it can be useful to understand how market participants 
tick. Market microstructure models try to do exactly that. Here, starting points are models for the 
individual buyers and sellers, for the environment within which they operate, and for how things are 
executed over time. Very much like in the real world, the prices are then the aggregate result. These 
models can exhibit complex behavior: they are (pseudo)random since their behavior is not (readily) 
predictable; they are adaptive because heterogeneous agents are connected and can change their 
behavior; there are patterns on an aggregate level that are not readily visible on the microlevel!®; 
they tend to be self-organizing and remarkably robust, even when extreme events occur.!? In that 
sense, financial markets are complex: participants interact and adapt, prices can exhibit time-series 
properties that cannot be traced back to any single individual or fundamental source, and even the 
worst stock market crash has not wiped out the financial system as such.”” 

It is surprising how simple models can often provide realistic price processes. One of the first 
examples is presented in Kim and Markowitz (1989) in which portfolio insurance strategies”! have 
the potential to reinforce positive and negative trends and eventually can lead to market crashes 
like the one in 1987. Subsequent models have become more and more sophisticated, including, 
for example, learning agents, realistic market regulations, or real-life information feeds. Similar 
to flight simulators, these models can be used for testing, experimenting, and training whenever 
real-life experiments are too costly or infeasible. For more on this, Tesfatsion and Judd (2006) offer 
a comprehensive introduction. 


17. In fact, investors considering higher moments will even be happy to accept slightly higher volatility if this helps reduce 
high kurtosis and/or increasing skewness; see Maringer (2008b). 

18. The phenomenon that the whole is more than the sum of its parts is called emergence. 

19. Miller and Page (2007) offer a good introduction to complex social systems, which also includes economic systems. 
20. Ina passing note, there are subtle differences between complex and chaotic systems. In (deterministic) chaotic systems, 
the main relationships are usually well known and stationary, but even the smallest of variations in initial conditions or 
imprecisions can lead to unpredictable behavior. In the well-known example of the butterfly effect by Edward Lorenz, the 
flap of a butterfly wing can trigger an unpredictable (chaotic) sequence of events, but it will not change the natural laws, nor 
will laws suddenly appear or disappear. In complex systems, “laws” can change: in a financial system, regulations can be 
modified, investors change their behavior, and new types of market participants (e.g., algo-traders) can enter the stage. 

21. See Section 9.1. 
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To illustrate the basic workings, let’s consider a simple foreign-exchange market” with N par- 
ticipants where each one is either a chartist or a fundamentalist. A chartist, C, expects that the 
current trend in prices will continue, while a fundamentalist, F, expects the price to return to some 
fundamental value p: 


E°(Api41) =gAp; and Ef (Ap,41) = v(p — pr), 


where g, v < | are positive scaling factors. For our purpose, it is sufficient to assume that the fun- 
damental price is constant; in a foreign-exchange market, this could be the long-term equilibrium 
exchange rate. The actual price change is then the aggregate market opinion plus some noise, that 
is, 


pisa = pi + (wi, f EF (Api) + (= wip) EC (Aprn) + er, (8.11) 


where wy, ¢ is the (perceived) current fraction of fundamentalists in the market and £; ~ N(O, o>). 

Agents do not invariably belong to one of the two groups but can change their type. On the 
one hand, they can be converted: when two agents interact, with a probability ô, the first agent will 
adopt the second’s position. Also, with a small probability €, an agent will randomly change her 
type. The former mechanism produces a positive feedback loop (the larger group is likely to recruit 
the remaining minority and generate herding), whereas the latter ensures that the market does get 
stuck in a single-type situation. Given suitable parameters, the combination of these two adaptation 
rules ensures that in the long run, majorities will swing. 

When simulating such a market, it is common to assume that there is a certain (maximum) num- 
ber of interactions per period, but to report only the end-of-period prices. Algorithm 27 provides the 
main steps; a MATLAB code is given below. Fig. 8.14 depicts a typical simulation result. As can be 
seen, volatility in the price changes (top panel) goes up when the number of fundamentalists is low. 
This effect will be stronger when herding is more pronounced (i.e., ô increases), and/or individual 
opinion changes become rare (i.e., € decreases). 


Listing 8.11: C-FinancialSimulations/M/./Ch08/FXagents.m 


l|% FXagents.m -- version 2011-01-06 

2|% -- parameters 

3|N_agents = 50; % number of agents 

4| N_days = 500; % number of days 

5) N_IPD = 20; % number of interactions per day 

6 

7| P_fund = 100; % fundamental price 

8|sigma_p = 0.1; % additional price volatility 

9 

10| g = 1; % adj. speed chartists; 0 < g <=1 

11| nu = .01; % adj. speed fundamentalisits; 0 < nu <= 1 
12| delta = .25; % probbility convincing 

13| epsilon = .01; % random type change 

14 

15|% -- initial setting 

16| N_I = N_days » N_IPD; % total number of interactions 

17| isFund = rand(N_agents,1)<.5; % type of investor 

18| P = nan(N_I,1); % intra period prices; initial values 
19| P(1:2) = P_fund + randn(2,1)*sigma_p; 

20;}w = nan(N_I,1); % perceived fraction of fundamentalists 
21 

22|% -- emergence over time 

23| for i = 342 N_I 

24 a = randperm(N_agents) ; 

25 if rand < delta , % recruitment 

26 isFund(a(2)) = isFund(a(1)); 

27 end 


22. The following example is heavily inspired by Kirman (1991, 1993), yet with several simplifications. 
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28 

29 if rand < epsilon, % individual change of opinion 

30 isFund(a(3)) = ~isFund(a(3)); 

31 end; 

32 

33 w(i) = mean(isFund); % perceived fraction of fundamentalists 
34 

35 % expected price changes and new price 

36 E_change_F(i) = (P_fund - P(i-1)) * nu; 

37 E_change_C(i) = (P(i-1)-P(i-2)) * g; 

38 change = w(i) * E_change_F(i) + (1l-w(i)) * E_change_C(i); 
39 P(i) = abs((P(i-1) + change) + randn*sigma_p); 

40| end; 

41 

42|% -- extract end of period prices 

43| t = N_IPD:N_IPD:N_I; 

44| S = P(t); 


Algorithm 27 Agent-based simulation of prices in a market with chartists and fundamentalists. 
1: initialize all parameters; 
2: for t = 1 : (number of periods x interactions per period) do 
3 with probability ô, choose two random agents and recruit 


4 with probability €, switch type of one randomly chosen agent 

5 for each group, compute expected changes EC (Ap;41) and E” (Apr+1) 
6: compute new price according to Eq. (8.11) 

7: end for 

8: report end-of-period prices 


6=0.25, £ =0.01 ô=0, £ =0.1 
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FIGURE 8.14 Price changes (top panel) and fraction of fundamentalists in the market (bottom panel) from an agent-based 
simulation (Algorithm 27). 


The model offers itself for numerous extensions. To name just a few: for other asset markets, 
the constant fundamental price can be replaced with some random process (or even a real-life time 
series). If market participants cannot observe the market composition directly, a noisy signal can be 
added to the fraction of fundamentalists. Furthermore, if participants tend to adapt their behavior 
to the (assumed) majority, regardless of his or her own opinion, a logistic transformation of actual 
to perceived weights, w, = 1/(1 + exp(—ywr)), with y > 0, can be used. 

A general guideline for modeling is Occam’s razor: Whenever two models work equally well, 
pick the simpler one. This is particularly true for agent-based models. Making the model larger by 
introducing more parameters might bring it closer to reality—however, it can also bring it closer 
to artifacts: because of calibration issues, unmanageable complexity and the peril of over-fitting 
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make the results hard to interpret. Also, unlike many econometric or “traditional” financial models, 
agent-based models are not designed for short-term predictions or point estimations. Their strengths 
lie in dealing with highly dynamic and complex situations and uncovering driving forces behind 
phenomena on an aggregate level. When applied, it is highly recommended to run many simulations 
and analyze typical patterns. This can then help to assess likelihoods for certain events and how the 
system as a whole reacts to variations in the ingredients. In that respect, they could be a suitable 
answer to Lucas’ critique?” of traditional econometric models. 


23. In essence, econometrician Robert Lucas famously criticized that predictions based on historical data leave out fun- 
damental changes and that the outcome will not be independent of the actions taken. For financial markets, traditional 
approaches are, therefore, of limited use when it comes to extreme market situations (with very few, if any, historical ceteris 
paribus observations) and interventions (which, in a complex fashion, will alter the course of events). The recent financial 
turmoils can be seen as examples for this. 
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Financial simulation at work: 
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9.1 Constant proportion portfolio insurance (CPPI) 
9.1.1 Basic concepts 


Constant proportion portfolio insurance (CPPI) is a dynamic asset allocation strategy that aims at 
providing a guaranteed minimum level of wealth G at maturity T while allowing some partici- 
pation in market profits.! In its simplest incarnation, products with a guaranteed payback can be 
considered as a portfolio: the present value of the guaranteed payment, the floor F; = Ge7""7—”, is 
invested in a risk-free asset, whereas the remainder of its current value V;, the cushion C; = V; — F;, 
is invested into a risky asset. This risky asset can be an option, and the product would be called 
option-based portfolio insurance (OBPI; see Leland and Rubinstein, 1976). Alternatively, a stock 
or an index could be used; however, since the cushion is usually rather small in proportion to the 
total value, the actual return of such a combination will mainly be driven by the risk-free asset’s 
yield. The CPPI strategy, therefore, suggests to invest a multiple m of the cushion in the risky asset 
(exposure, E; = mC;,) and hold less than the floor in the safe asset, B; = V; — E; = V; — mC. If 
the risky asset goes up, so will the cushion, and the exposure will be increased; if it goes down, 
the lower cushion will trigger a reduction in the exposure. The mechanism is so that V, never falls 
below the floor. Note that in bullish markets, the (theoretical) exposure could exceed V;. We assume 
that the investor may not (or does not want to) go short in the safe asset and, therefore, introduce 
a ceiling value on E,. Real-life products can differ: some do allow (limited) short selling of bonds, 
whereas others increase the guarantee according to prespecified rules (“ratcheting”). 
In the absence of transaction costs, the entire model reads as follows: 


V =F; +C: total value 
F; = G exp(—r (T — t)) floor 

E; = min(mC;, Vi) exposure 
B, = V, — E; safe assets 


where r denotes the constant safe return of the bond per period, also used as the discount factor for 
the floor. The MATLAB® function CPPIgap.m is a simple implementation for this model for a 
given price process. Fig. 9.1 illustrates this behavior for two different price processes. If m = 1, the 
CPPI is a simple buy-and-hold portfolio mainly consisting of the safe asset. If m > 1, additional 
exposure is built up whenever the risky asset goes up, and vice versa. For the rightmost column, the 


1. See Perold (1986) and Black and Jones (1987). 
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positions are readjusted only with a low frequency, leading to overexposure when the price of the 
risky asset drops; ultimately, this could result in violating the floor constraint. 


Listing 9.1: C-CaseStudies/M//Ch09/CPPIgap.m 


function [V C B F E] = CPPIgap(S, m, G, r_c, gap) 

% CPPIgap.m version -- 2010-12-22 

% S .. stock price series t =0..T 

% G .. guarantueed payback amount 

% m .. multiplier 

% r_c .. cumulated save return over entire horizon 

% gap .. readjustment frequency; if blank: 1 = always 


if nargin < 5, gap = 1; end; 


% --initial setting 

T = length(S)-1; 

b= ONT 

V = zeros(T+1,1); 

V(I) S43 

F = G»exp(-r_c»((T-t)/T))’'; 


Ree ee ee ee ee 
OAAIADMAWNFTOAADUNHWN KE 


C = zeros(T+1,1); 


20|B = zeros(T+1,1); 

21)/n = zeros(T+1,1); % number of risky assets 
22 

23|% --development over time 

24) for tau=1:T % tau = t+1 

25 C (tau) = V(tau)-F (tau); 

26 

27 if mod(tau-1, gap) == % re-adjust now 
28 E(tau) = min(m * C(tau), V(tau)); 
29 n(tau) = E(tau) / S(tau); 

30 B(tau) = V(tau) - E(tau); 

31 else 

32 n(tau) = n(tau-1); 

33 E(tau) = V(tau) - B(tau); 

34 end; 

35 

36 B(tau+1) = B(tau) *exp(r_c/T); 

37 V(tau+1) = n(tau)*S(tau+1) + B(tau+1); 
38| end 


The conceptual beauty of this strategy is hampered by real-life limitations. The risky assets 
are often hedge funds or other investment funds that are only traded at low intervals or might be 
sensitive to large volumes. The former can lead to gap risk: if the price drops substantially between 
adjustments, the floor constraint can be violated and the shortfalls are possible.” The latter can, in 
extreme cases, lead to domino effects when price drops trigger sales, which lead to further price 
drops.° 

When the risky asset’s distribution does not follow a well-behaved geometrical Brownian mo- 
tion or whenever frictionless trading is not possible, the CPPI’s terminal distribution cannot be 
assessed analytically. In that case, Monte Carlo simulations can help. First, price processes are 
generated; next, the corresponding CPPIs are simulated; and finally, the simulated CPPI processes 
are analyzed. 


2. For a (simulation-based) analysis of gap risk, see Khuman et al. (2008); remedies are suggested in Maringer and Ramto- 
hul (2011). 
3. Kim and Markowitz (1989) show in a simulation study how this leads to market crashes like the one in 1987; see also 
Section 8.8. 
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FIGURE 9.1 CPPI composition for two sample price processes and different multipliers time to maturity of 2 years and 
daily readjustment (“no gap risk”) and quarterly readjustment (rightmost; “with gap risk”). Light gray: safe asset, Bz; dark 
gray: exposure, E;; white line: floor, F;. 
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9.1.2 Bootstrap 


To illustrate the workings of this strategy, assume the example of a CPPI with multiplier m = 4 
and a maturity of 1 year (250 days). The risky asset is the FTSE, and the safe return is 3%. The 
following simulations are based on the FTSE prices for the period January 4, 2005 to September 
29, 2010 downloaded from finance.yahoo.com and comprising 1491 daily prices. The vector with 
the prices will be called S. Finally, assume that the guarantee is for G = 1000. The price of the risky 
asset at t = 0 is assumed to be equal to the last one in the historical time series. Note, however, that 
this does not matter; the number of stocks held would change, but not the exposure. 

To see how the CPPI behaves over time, one can simulate its prices depending on a scenario 
for the underlying’s price process. For this purpose, a bootstrap can be used. To capture serial 
dependencies in returns, a block length of 20 days is chosen. Using previously introduced functions, 
this can be done by 


S_bs bootstrap_price(S, 250, 20); 


Next, the price path for the CPPI in this scenario has to be computed: 


CPPI_bs CPPIgap(S_bs, m, G, 0.03, gap); 


where m and G are the multiplier and guarantee, respectively. gap is the readjustment frequency; 
1 means daily readjustment; 5, 20, and 60 represent once per week, once every fourth week, and 
once per quarter, respectively. 

If one repeats these steps several times, one can already get some idea about the CPPI’s behavior 
over time. Fig. 9.2 depicts the original time series of the FTSE (left panel). The second panel 
contains 15 sample trajectories for a 250-day horizon, generated with a block bootstrap with block 
length b = 20. The two panels on the right exhibit the corresponding CPPI trajectories with daily 
(gap = 1) and quarterly (gap = 60) readjustments. The floor is given by the dashed line. Note that 
before maturity, the value of the CPPI can fall below the guarantee but should not fall underneath 
the floor (dashed line). 

After increasing the number of samples further, one can analyze the CPPI’s payoff distribution. 
The graphs in Fig. 9.3 are based on 5000 simulations. They show scatterplots for the prices of the 
risky asset and the CPPI at maturity. As can be seen, the payoff relationship is similar to that of 
a call option plus a safe investment (i.e., the option-based portfolio insurance, OBPI). Ideally, the 
risky asset follows a continuous price process, and continuous readjustment is possible. Then, the 
value of the CPPI at maturity should be above the guarantee and pay at least some return; and 
even if the risky asset defaults, the investor should still receive the guarantee. Moving away from 
the ideal situation undermines the properties. In particular, if the readjustment cycles get longer, 
there are occasions where the value of the CPPI falls short of the guaranteed amount G (dashed 
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FIGURE 9.2 FTSE time series, 10-block bootstraps for risky asset and trajectories for CPPI. 
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FIGURE 9.3 Terminal value of CPPI with FTSE as T = 1 year, multiplier m = 4, and different gap lengths, simulated with 
block bootstrap (right). 


horizontal line in Fig. 9.3). Also, when looking at the results for gap lengths 5 and 20, one can spot 
scatter points close to the guarantee that appear arranged on curves. What seems like artifacts is in 
fact an illustration of the gap risk: if a big price drop in the risky asset occurs close to a readjustment 
point, the CPPI will encounter gap risk. When the time gap until readjustment gets bigger, too many 
losses will have been accrued, and the CPPI will never fully recover. 

The panel on the right-hand side of Fig. 9.3 provides the kernel densities for the log returns 
of the risky asset and the CPPIs for different gaps. Obviously, the CPPI never encounters such 
massive losses as the buy-and-hold investor for the risky asset, but it also has lower profits. The 
dotted vertical lines indicate safe return. Note that the CPPI strategy has a mode below the safe 
return, but positive skewness. The means are, therefore, higher in these simulations, ranging from 
2.9% to 3.15%. 


9.2 VaR estimation with Extreme Value Theory 


9.2.1 Basic concepts 


Value-at-Risk (VaR) is the maximum loss that one will not exceed with a certain probability a 
within a given time horizon. VaR has gained considerable importance as a main risk measure. 
One main reason for its popularity is that it is very intuitive, and its numerical values are easier 
to interpret than other risk measures such as variance or the omega. Another reason is that it is 
sanctioned by regulators in the Basel II and Basel III accords. 

Although conceptually rather simple, estimating the VaR is often challenging. This risk measure 
focuses on rare events and, by definition, there are only few historical observations to calibrate 
models on. At the same time, parametric distributions often seem to work reasonably well for the 
mass of the distribution, but not so well for the tails. Consider the usual assumption of a normal 
distribution: 6.4 standard deviations below the mean represent a “one day since the big bang” 
quantile; the FTSE knows five such days in the 25-year horizon of October 1985 to September 
2010 alone. 
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9.2.2 Scaling the data 


Time-varying volatility can also be a major problem for estimating the Value-at-Risk properly. One 
solution to this problem is to fit a suitable econometric model onto the data and then derive the 
theoretical VaR for this model. Hull and White (1998) suggest a similar approach. Their idea is to 
correct the returns r; for the time-varying volatility s; and scale them such that they all have the 
same (typically, the current) volatility, sy: 

= ST 


ry = — ft. 
St 


The volatility s; can be estimated using, for example, a GARCH or any other suitable process.* 


Using MATLAB’s toolbox, this can be done as follows: 


fit = estimate(garch(1,1),r); 
Var_T = fit.forecast (1); 
r_tilde = sqrt(Var_T) * r./sqrt(fit.infer(r)); 


In case, older versions of the toolbox are used: 
[Coeff,Errors,LLF,Innovations,Sigmas,Summary] = garchfit(r); 


[SigmaForecast,MeanForecast] = garchpred(Coeff,r); 
r_tilde = SigmaForecast * r./Sigmas; 


Correcting for the time-varying volatility brings the data closer to being identically distributed, 
which is one of the assumptions when using Extreme Value Theory. Time-varying volatility can 
account for changes in the higher moments as well. If skewness and/or kurtosis are time varying, 
too, extended filtering techniques can be used.’ 


9.2.3 Using Extreme Value Theory 
The Hill estimator 


The simplest approach for analyzing the tail of a distribution is to assume a certain functional form 
and estimate the parameters such that it fits the data the best. For exponential decay, it is done via 
the Hill estimator,° 


z 1 m 1 m 
m= $log (ro /ron+D) = m $ logro) — log(rn+1y), 


i=l i=l 


where rq) is the ith order statistic of the returns, that is, the ith smallest return. The VaR estimate 
for the œ quantile can then be estimated by 


Emin 
F m/N 
VaR! = F(m+1) (“2 ) X 


Though widely used in practice and heavily endorsed by authorities, VaR has received its fair share 
of criticism because it has some undesirable properties.” The conditional Value-at-Risk, cVaR, is 
the expected loss encountered in case there is a shortfall, and has some theoretical advantages over 
VaR. When using the Hill estimator, it can readily be computed by 


Hill 
VaRg 


cVaRHill = a; 
1 — yin 


4. See also the example on page 172 in which real returns are transformed into something close to white noise. 

5. See Maringer and Pliota (2008a), in which time-varying threshold and filtering techniques for higher moments are pre- 
sented in the context of VaR estimation. 

6. See, for example, Christoffersen (2003). 

7. Most importantly, it is not a coherent risk measure see Artzner et al. (1999), implying that risk diversification is not 
measured satisfactorily and that, in an optimization framework, it is prone to overfitting and unstable portfolios; see Maringer 
(2005a) for stock portfolios and Winker and Maringer (2007a) for bond portfolios. 
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Listing 9.2: C-CaseStudies/M/./Ch09/VaRHill.m 


% --adjust parameters where necessary 
if m< 1; m = ceil(m*length(r)); end; 
9ļif a> 1; a = a/100; end 

10| if a > .5; a = 1-a; end 


1| function [VaR ES ksi] = VaRHill(r,m,a); 
2|% VaRHill.m -- version 2011-01-06 

3] % r ... historical returns 

4|% m ... number of largest losses used 
5|% a... probability VaR is exceeded 

6 

7 

8 


12|% --compute Hill estimator 
13) r_order = sort(r); 


14| ksi = sum(log(r_order(1:m)/r_order(m+1))) / m; 

15 

16|% --compute the VaR and ES 

17| VaR = r_order(m+1) * ( (m/length(r)) ./ a) .*ksi; 
18)ES = VaR ./ ( 1-ksi); 


Further Extreme Value Theory approaches 


Apart from the Hill estimator, other popular extreme value® approaches include the generalized 
Pareto distribution (GPD) for the block maxima approach’ and the generalized extreme value ap- 
proach (GEV) for the peaks over threshold!” case. In either of these cases, a crucial question is to 
separate the extreme events from the regular ones: the larger the blocks and the larger the threshold, 
the fewer events will be considered. In particular for the GPD and the GEV, the convergence results 
assume an arbitrarily large threshold. This, however, hampers real-world applications: with a lim- 
ited number of observations, a strict threshold leaves very few observations for model calibration, 
and arbitrarily increasing the sample length by looking further into the past increases the peril of 
including irrelevant observations or, even worse, observations from different regimes.!'! Modifying 
the threshold (or block length or critical return r(m+1), respectively) so that more observations make 
it into the analysis shifts the focus from the tail to the mass of the distribution, defeating the main 
purpose of EVT in the first place. 


Threshold choice 


To get a first idea about threshold choice, a Monte Carlo simulation can help. Let’s assume the 
case that we know the underlying distribution, and we want to find out, for a given sample length 
N, what might be a good threshold—or, for the Hill estimator, what fraction m/N of the available 
N observations ought to be used. To illustrate this idea, we distinguish three cases: (i) returns 
are normally distributed (without loss of generality, we will assume r ~ N(O, 1)); and (ii) returns 
follow a Student rf distribution with v = 5 degrees of freedom and have, therefore, a slightly heavy 
tail; and (iii) data are generated via the Cornish—Fisher expansion with S = —0.3. The procedure is 
then to repeatedly draw samples from the specified distributions and estimate the VaR at different 
confidence levels œ and using different thresholds T = r(m4+1). In either of these cases, the sample 
length is 1250, corresponding to 5 years’ worth of daily data. 

For each of these simulations, the estimated VaR is compared with the corresponding theoretical 
quantile of the distribution. If VaR#!"'/VaR"©° = 1, then the estimates are correct; values above one 
indicate that the Hill estimator gives too conservative estimates, whereas values less than one imply 
that the VaR is underestimated. 


8. For a general introduction to Extreme Value Theory in finance, see, for example, Embrechts et al. (2003). 

9. In the block maxima approach, the entire sample is split into k blocks of equal length, and the “worst” (i.e., lowest) 
observation is considered. 

10. In the peaks over threshold approach, a critical value, the threshold, is introduced and the exceedances of this threshold 
are then analyzed. 

11. Maringer and Pliota (2008b) discuss empirical aspects of sample length choice for VaR estimation with EVT. 
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Listing 9.3: C-CaseStudies/M/./Ch09/ex VarThreshold.m 


% exVaRThreshold.m -- version 2011-01-10 

N = 1250; 

Nruns = 1000; 

alpha = logspace(0,1,18).*2 / 1e3; % [.001 .005 .01 .025 .05 .075 .1]; 
mOverNSet = [ .01 .025 .05 .1]; 

sizemOverNSet = length(mOverNSet) +1; 

mSet = ceil (mOverNSet«N) ; 


BB ioscan normal distribution 

VaR_normal = nan (Nruns, length (alpha) ,length(mSet) +1); 
for run = 1:Nruns 

12 r = randn(N,1); 

13 for j = 1:length(mSet) 

14 m = mSet(j); 

15 VaR_normal(run,:,j) = VaRHill(r,m,alpha) ; 

16 end; 

17 VaR_normal(run,:,end) = quantile(r,alpha) ; 

18] end; 


ond 
FOOmMAINDNHPWN KE 


20) %3 ..... student t distribution 
21| VaR_student = nan (Nruns, length (alpha) ,length(mSet) +1) ; 
22| for run = 1:Nruns 


23 r = tinv(rand(N,1),5); 

24 for j = 1:length(mSet) 

25 m = mSet(j); 

26 VaR_student (run,:,j) = VaRHill(r,m,alpha) ; 
27 end; 

28 VaR_student (run, :,end) = quantile(r,alpha); 

29| end; 

30 

BUBB geiau via Cornish Fisher approximation 

32| VaR_CF = nan (Nruns, length (alpha) , length (mSet)+1); 
33| for run = 1:Nruns 

34 r = CornishFisherSimulation(0,1,-.3,3,N); 

35 for j = 1:length(mSet) 

36 m = mSet(j); 

37 VaR_CF (run, :,j) = VaRHill(r,m,alpha) ; 

38 end; 

39 VaR_CF (run, :, end) = quantile(r,alpha) ; 

40| end; 


Repeating this experiment 1000 times shows that the Hill estimator’s assumption of an exponen- 
tial decay can lead to unreliable VaR estimates if the fraction of data used, m /N , differs noticeably 
from the aspired VaR confidence level œ; see Fig. 9.4. For other common distributions as well as 
other EVT approaches, results appear similar. !7 


9.3 Option pricing 


The fair value of an option (in fact, of any asset) can be written as 


fair value = discount factor x expected payoff. 


12. Here, we demonstrate the procedures only for the Hill estimator. Maringer and Pliota (2008c), on whose findings this 
section builds, extend their analysis to GPD and to cVaR, both for artificial data from a parametric distribution and from a 
kernel density estimation fitted to real stock returns. 
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FIGURE 9.4 Median of ratios VaR! /VaRtheo for different underlying distributions (panels) and different fractions of 
data used, m/N (lines). 


Let the payoff p be a deterministic function of a vector Y of state variables. This is the typical case 
for options; Y could, for instance, be a stock price. So we get 


+00 


Ey) = f o(Y) dF (Y), (9.1) 


=00 


with F the distribution of Y. To make this approach operational, we need to decide on a model 
for the distribution F of the underlying state variables, and we need to choose the discount factor. 
Then, we need to evaluate the integral. The +00 is not a problem: we could either use quadrature 
schemes that change the variable, or simply cut off the integral at reasonable levels. For a short-term 
option, if Y is a stock price that is initially at 100, it suffices to integrate from 50 to 200, say. 

Now, suppose we could somehow obtain a sample yj, y2,..., yy of Y. Then we could replace 
Eq. (9.1) by 


N 
. 1 
estimator of E(y) = N > (yi). 


i=l 


This is the essence of the Monte Carlo (MC) approach to option pricing. In the remainder of this 
chapter, we will always use the risk-free rate r to discount the expected payoff. Thus, we only need 
to think about the dynamics of Y, that is, how we can produce the sample. 


9.3.1 Modeling prices 


In mathematical finance, the evolution of quantities like prices over time is almost always described 
by stochastic differential equations (SDEs). Then, a process § (a stock price for instance) is typi- 
cally characterized by 


dS, = gi(S;,t)dt + g2(S;,t)dz; . (9.2) 
—_——_—_— —— 
deterministic drift random shock 


The change in S in a small time interval is the sum of a deterministic drift, and a random component. 
For the applications here, always think of S as a stock price; gı and g2 are functions of S and time 
t; and z is a Wiener process: 


e zo 1S zero, 

e the expected change of z is zero: E(z1, = zn) =0fo t2 >t, 

e the quantity Zn — z;, is normally distributed with variance equal to the size of the time step, that 
iS, Var (zn = 24) =)—-fth. 


Financial simulation at work: some case studies Chapter | 9 197 


We will always work with a discretized version of this equation: 


Sra, — St = 81 (Sr, t) Ar + 92081, 0) (Zr4-a, — Zt). (9.3) 


The second and third property of the Wiener process indicate that to simulate dz we can draw a 
standard Gaussian variate Y ~ N(0, 1), and multiply it with the square root of the time step Az. 
Since we always discretize with a time step greater than zero, we write ,/A;Y instead of dzz. 

Discretizing an SDE in general leads to discretization error (a type of truncation error). Thus, to 
obtain the actual solution of the SDE, the time step would need to go to zero, which is not possible 
on a computer. There are higher order schemes that reduce this discretization error. But we need to 
remember what our goals are. The SDE itself is only a model; modeling prices in continuous time 
is mathematically more convenient, but it is in itself an approximation of reality since prices do not 
change continuously. So, we will always work with this simple discretization scheme, called Euler 
scheme. If you want to go deeper into the numerical solution of SDEs, see Kloeden and Platen 
(1999). Iacus (2008) discusses simulation of SDEs in R. 


Example 9.1 Barrier options 


Suppose we wanted to price barrier options in a Black-Scholes world. For concreteness, assume a 
down-and-out call with European exercise. So if the underlier, modeled by a Brownian motion, pene- 
trates the barrier from above, the option expires worthless. For this type of option, a closed-form solution 
is known. It is easy to see that if we price the option with a discretized version of the SDE, we will al- 
ways overestimate the price when compared with the analytical model price: we only sample the SDE 
at given points, and even if the price does not go through the barrier at these points, it could do so 
between the points. We could reduce this bias by making the time step smaller and smaller. But the bias 
is there only when compared with the analytic solution of a model which is not true anyway; the actual 
price does not follow a Brownian motion. 


Option prices are functions of the underlying assets, or—in the model—functions of the SDEs. 
Since we assume that the underlier follows an SDE, so must the option price. Since the option 
payoff is a deterministic function of the underlier’s price, the stochastic process that drives the 
underlier is the same one that drives the option. Thus, Black, Scholes, and Merton came up with 
the ingenious idea to not look at the option itself, but at a portfolio of the underlier and the option. 
This portfolio is chosen such that the stochastic driver just cancels out and, hence, we are left with 
a deterministic (partial) differential equation. Solving such equations is discussed in Chapter 4. 
Here we take a different approach: we simulate a large number of paths of the underlier, that is, we 
simulate the SDE; for each path, we compute the option payoff. Averaging over these payoffs, we 
get the option price. 

As a numerical technique, MC is not very efficient as the sampling error only decreases with 
the square root of the number of samples. So to get one additional digit in numerical precision, we 
need to increase N a hundredfold. Then why would we use this approach? 


e First, computing power is getting cheaper by the day. At the same time, MC is simple, and most 
flexible. Even for models with an analytic solution, MC is an ideal candidate to test implemen- 
tations (e.g., to check prices computed from a complicated analytical solution). 

e The expected error of MC methods does not depend on the dimensionality of the problem. A 
typical application in which this is relevant are path-dependent options: here any point along the 
time path represents one dimension, so modeling a one-year option with 250 time steps means 
that we have 250 dimensions (but see the discussion of effective dimension, page 215). For 
problems in higher dimensions, MC is often the only feasible strategy. 

e MC can easily benefit from distributed computing. MC is a prime example for computations that 
are—as Cleve Moler once called them—‘“embarrassingly parallel.” 


Next we will discuss some examples of processes. We will always work in the risk-neutral world, so 
r will be the risk-free interest rate, and q will be the continuous payout of the asset (the dividend). 
Current time, fp, is zero; the option expires at t. 
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Arithmetic Brownian motion 


Arithmetic Brownian motion was suggested by Bachelier (1900). We set gı =r — q and g2 = /v, 
so over a time step the stock price changes by a deterministic drift term r — q and a random shock 
with constant variance v. 


dS; = (r — q)dt + J/vdz; = (r — q)dt + Vv VAY . (9.4) 


If we model § with such an equation, the stock price itself will be normally distributed. Thus, S 
can become negative, and variance is constant in units of S. This SDE has an explicit solution: 


Sp = So + (r — q)t t+ vuz: = So + (r — q) Ar +JS0V/ArY. (9.5) 


Such a solution is convenient if we just need the terminal level of S: we can now “jump” to the 
end in one time step; we need only one Y. Unfortunately, such solutions do not exist for all types 
of SDEs; an example that we see later on is the Heston model. And sometimes we really want the 
path, for instance, for path-dependent options. 


Geometric Brownian motion 


Alternatively we can assume that the returns are normally distributed, so we get: 


dS; = (r — q)S;dt + JvS,dz; = (r — q)S;dt + VUS AY . (9.6) 


All that has changed is that we have added S, into the drift and shock term. Again we have an 
analytic solution: 


Sr = Soexp((r — q — ¥/2)t + Vvzr) = Soexp((r — q — ¥/2)t + VATY). (9.7) 


We see that it is often more convenient to work with the logarithm of S, s = log(S): 


Sp = S0 + (r — q — !/2) Ar VAY. (9.8) 


If we want the path, there exist actually different ways in which we can simulate this equation. 


The most straightforward is “time stepping” where we divide the interval [0, t] into M = IA 


subintervals. The function pricepaths creates paths of geometric Brownian motion (GBM); 
see Fig. 9.5. 


Listing 9.4: C-CaseStudies/M/./Ch09/pricepaths.m 


function paths = pricepaths(S,tau,r,q,v,M,N) 
pricepaths.m -- version 2010-12-08 

spot 

= time to mat 

= riskfree rate 

= dividend yield 

volatility%*2 

= time steps 

= number of paths 

dt = tau/M; 
gl = (r - q - v/2)*dt; g2 = sqrt(v * dt); 

aux = cumsum([log(S)*ones(1,N); g1 + g2 * randn(M,N)],1); 
paths = exp (aux); 
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In fact, we are not forced to construct the paths of GBM in chronological order. We could, for 
instance, first jump to the end (or some other prespecified time) and then construct the remaining 
points conditionally on this point. This technique is called a (geometric) Brownian bridge (lacus, 
2008). 

A Brownian bridge can also be generated in a vectorized way; see the following code. The func- 
tion generates a standardized GBM that starts at zo at time fp and ends at z; at time t+. Fig. 9.6 gives 
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FIGURE 9.5 20 paths of geometric Brownian motion. 
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FIGURE 9.6 Five paths of a geometric Brownian bridge, starting at 100 and ending at 103. 


examples. Note that with this approach, the random numbers that are used to construct the bridge 
are taken sequentially. In other words, the first random variate generates the first time step, the 
second variate generates the second time step, and so on. When we use so-called low-discrepancy 
sequences (discussed later), we may not wish to generate random steps sequentially, but recur- 
sively. 


Listing 9.5: C-CaseStudies/M/./Ch09/bridge.m 


function b = bridge(t0,ttau,z0,ztau,M) 
% bridge.m -- version 2010-12-08 


AB WN eS 


dt = (ttau-t0)/M; vt = linspace(t0,ttau,M+1)’ 
vz = [0; cumsum(randn(M,1) * sqrt(dt))]; 
b = z0 + vz - (vt - enV ieten - t0) .* (vz(M + 1) - ztau + z0); 


9.3.2 Pricing models 


Now we give some concrete examples of option pricing models. 


Black-Scholes 


Black and Scholes (1973) model the stock price S under the risk-neutral measure via geometric 
Brownian motion, that is, 


dS, = (r — q)S,dt + JvS,dz, . (9.9) 
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The volatility ./v is constant. Of course, we would not need to use MC for this model; but it serves 
as a benchmark. The function cal 1BSM implements the analytic solution. Note that we write the 
price here in terms of variance (volatility squared). 


Listing 9.6: C-CaseStudies/M/./Ch09/callBSM.m 


function call = callBSM(S,X,tau,r,q,v) 

% callBSM.m -- version 2010-12-08 

6S = spot 

% X = strike 

% tau = time to mat 

riskfree rate 

Sq = dividend yield 

Sv = volatility%’%2 

dl = ( log(S/X) + (r - q + v/2)*tau ) / (sqrt(v*tau)); 
d2 = dl - sqrt(v*tau); 

call = S*exp(-q*tau) *normcdf(d1) - X*exp(-r«tau) *normcdf (d2); 
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The MC function cal1BSMMC uses the function pricepaths defined before. 


Listing 9.7: C-CaseStudies/M/./Ch09/callBSMMC.m 


1) function [call,payoff] = callBSMMC(S,X,tau,r,q,v,M,N) 
2|% callBSMMC.m -- version 2010-12-10 
3l% S = spot 

4|% X = strike 

5|% tau = time to maturity 

6% r = riskfree rate 

7% q = dividend yield 

8l v = volatility^2 

9% M = time steps 

10% N = number of paths 

11] S = pricepaths (S,tau,r,q,v,M,N); 

12| payoff = max((S(end,:)-X),0); 

13| payoff = exp(-r*tau) * payoff; 

14| call = mean (payoff); 


We return the call price, and also the vector of payoffs associated with the N paths we have gen- 
erated. This allows us to construct confidence intervals. Recall that a confidence interval is our 
estimate plus and minus a multiple of the standard error, that is, the sampling error of the quantity 
we just estimated. We computed the mean payoff, the standard error is then the standard deviation 
of this payoff across all paths, divided by the square root of the number of paths. The numerator of 
this ratio is distributed as a Gaussian (by the Central Limit Theorem); the sample variance in the 
denominator follows as x? distribution. Hence, the ratio has a f-distribution. Practically, we can 
always work with the Gaussian: N will rarely be small. So, to obtain a 95% confidence interval, 
we just use + two standard errors about our estimate. The width of this confidence interval is de- 
termined by the sample variance of the payoff, and N. Hence to get more precise estimates, we can 
either reduce the variance or increase N (or both). 

We should not care too much about precise confidence intervals. Their main use is to compare 
the variability of two different estimators. This shows whether one method gives more reliable 
results than another, that is, which one has smaller confidence intervals. In fact, to meaningfully 
interpret confidence intervals, we would need to establish that our estimator is actually unbiased. 

Example code follows. We have set M to 1, since we can directly jump to the end of the time 
path. 


Listing 9.8: C-CaseStudies/M/./Ch09/exBSM.m 


l|% exBMS.m -- version 2010-12-08 
2| S = 100; 
3) X = 100; 


tau 


< 
| 


eCeADMNA 
Q 
l 


9|% MC par 
M= 1; 

N = 1000 
13| 33 analy 
14| tic, cal 


15| fprintf ('The analytic solution.\ncall price %6.2f. It took %6.2f seconds.\n’, 
call,t) 

16 

17| 3% MC 

18| tic, [call, payoff] = callBSMMC(S,X,tau,r,q,v,M,N); t=toc; 

19| SE = std(payoff)/sqrt(N) ; 


= 0.03; 
=: 10:.:05.% 
= 02 2s 
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1/2; 


ameters 
00; 


tic solution 
1 = callBSM(S,X,tau,r,q,v); t=toc; 


We should obtain something like the following: 


The analytic solution. 
call price 505. It took 0.00 seconds. 


MC 1 (vectorized). 
call price 5.06. lower band Bg Ode upper band 5 2 


width o 


£°CE 0211. It took 0.03 seconds. 


For a not-path-dependent model as Black-Scholes, there is much in the function pricepaths 
that we actually do not need (and, hence, should not do). pricepaths computes and stores a 
matrix of size (M+1) x N; each of the N columns is one price path with M steps (plus the initial 


price). This 


is fast in MATLAB (or R), faster than looping through the columns and rows, but will 


not work for larger N and M since the matrix will not fit into memory. An efficient way would be 
to split N into smaller pieces that fit into memory, or run every path. In any case, there is no need 
to use exp () on any price along the paths; the terminal price suffices. 

The function cal 1BSMMC2 is a second example. It does not use matrices but creates a single 
path at a time. It is not much slower than cal 1BSMMC, but allows much larger N. It is in fact faster 
than cal1BSMMC for larger M (M > 100, say). 


Listing 9.9: C-CaseStudies/M/./Ch09/callIBSMMC2.m 


function 
callBSs 


dP dP P P dP dP 
tx NM 
o 
g 
C T lS 


oP 0 
25s Qh 
1l 


oe 


dt = tau 
gits: (x 
sumPayof 
s = log( 
forn= 
T 
z= 
Send 
payo 


— = =e el eee 
OCWAANDMPWNFK TDOAADUNHPWN = 


20 sumPayoff = payoff + sumPayoff; 
21 Scompute variance 

22 ae my Se È 

23 T = T + payoff; 


= strike 
= time to maturity 


{call,Q]= callBSMMC2(S,X,tau,r,q,v,M,N) 
MMC2.m -- version 2010-12-10 
spot 


riskfree rate 
dividend yield 
volatility%2 

time steps 
number of paths 

/M; 

- q - v/2)*dt; g2 = sqrt(v«dt); 
£ s pi oTe O° =" 05 
S); 

1:N 
gl + g2*randn(M,1); 
cumsum(z)+s; 

= exp(z(end)); 

ff = max(Send-X,0); 
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24 Q = Q + (1/(n*(n-1))) * (n*payoff - T)%*2; 
25 else 

26 T = payoff; 

27 end 

28| end 


29| call = exp(-r*tau) * (sumPayoff/N) ; 


One-pass algorithms for variance 


The function cal 1BSMMC2 computes the variance iteratively with a one-pass algorithm. The sam- 
ple variance of a vector Y can be computed as follows: 


N 
Var(Y) = i XO; - my)’. 


i=1 


This formula requires the mean of Y, computed as 


1 N 
mpa DY, 


hence, we need to pass twice through the data. We use a one-pass algorithm; see Algorithm 28 
(Youngs and Cramer, 1971, Chan et al., 1983). 

The advantages are that we do not need to store the results of all single paths and we could stop 
the algorithm once the standard error has reached a specified size. Then the variance needs to be 
computed in the loop; see Statement 7 in Algorithm 28. 


Algorithm 28 One-pass algorithm for computing the variance. 
1: set Q=0 # Q is the sum of squares of Y — my 
2: set T = Y] # T is the sum of Y 
3: for i = 2 to N do 
4 compute T =T + Y; 

5: compute Q = Q + mli — T)? 

6 

T 


: end for 
: compute variance Q/N 


Fig. 9.7 gives examples for the convergence of the price. The settings were So = 50, X = 50, r = 
0.05, /v = 0.30, and t = 1. We plot the obtained price (the dashed line), the true Black-Scholes 
price (7.12; the solid line), and a 95% confidence interval. We require a large N to get accurate 
numbers with respect to the numerical benchmark. For N = 100,000, the confidence interval has 
a width of about 15 cents, or 2% of the option price; with 1,000,000 steps, it shrinks to 5 cents, 
less than 1% of the option price. To gain intuition about these magnitudes, we can perturb the input 
parameters. With an interest rate of 0.052 instead of 0.05, the call price under Black-Scholes would 


sul 


6 1 ra al 1 1 1 
10° 10* 10° 10° 10’ 10° 10 10° 10° 10’ 
N N 


FIGURE 9.7 Speed of convergence for Monte Carlo pricing. Sọ = 50, X = 50, r = 0.05, ./v = 0.30, and t = 1. The true 
Black-Scholes price is 7.12. 
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have been about 5 cents higher; with a volatility of 0.31 instead of 0.30, the price would increase 
by about 20 cents. 


Variance reduction 1: antithetic variates 


The standard error of the mean estimate of some quantity is the standard deviation of the sampled 
quantity over the square root of the number of samples N. So, to decrease the error in MC methods, 
we can either increase N, or decrease the variance of the quantities that we compute from our 
sample. This latter strategy goes by the name of variance reduction (see also page 121). Variance 
reduction can often help to make simulations more efficient, though it also makes the procedures 
more complicated and slower, and more vulnerable to (possibly subtle) errors. 

For two random variables Yı and Y2 with the same variance, the following holds: 


Y Y: 1 
Var (=>) = (Var) 4 Var(¥2) + 2Cov(71, v). 


So we can reduce the variance if Y; and Y2 are negatively correlated. A simple and extreme example 
is uniform variates. If U is distributed uniformly between 0 and 1, so is 1 — U, but the linear 
correlation between the two variates is —1. Imagine we want to compute the mean, which is 0.5, 
of a sample of such uniforms. We could sample N numbers, and compute 1/N $} U. Or we sample 
N/2 numbers, and increase the sample to size N by computing 1 — U for all U. So we have pairs 
[U, 1 — U], and the mean of each pair is exactly (U + 1 — U)/2=0.5. 

To benefit from antithetic (i.e., negatively correlated) variates, we need to make sure that this 
“anti-correlation” propagates into the things we actually compute with the random numbers. If we 
use the uniforms to generate nonuniforms, we need be to careful which method we use: the in- 
verse is monotone, so (at least rank) correlation remains. But with acceptance—rejection methods or 
transformations this is not guaranteed. The contrary, actually: specific algorithms require explicitly 
uniforms that are independent. So a safer method—if possible—is to induce negative correlation 
at a higher level. If we use Gaussian variates Y, we can use —Y (which would have been just as 
likely). 

Consider a portfolio of two identical options on underliers 5“) and S®. Suppose these S satisfy 
the same SDE, but we switch the signs of the random variates used to simulate them. 


ds) = rsVde+ fo sVdz 
ds? = rS%dr — fv s@dz 


The options should have the same price (the same drift and variance term in the SDEs). But S 
and S® are negatively correlated, hence the sampling variance of S is reduced, and consequently 
the variance of the mean of the two options is smaller than through computing the option price from 
just one equation. 

The following functions cal 1BSMMC3 and cal1BSMMC4 show implementations of antithetic 
variates (version 3 uses loops, version 4 is vectorized). 


Listing 9.10: C-CaseStudies/M/./Ch09/callIBSMMC3.m 


function [call,Q]= callBSMMC3(S,X,tau,r,q,v,M,N) 
callBSMMC3.m -- version 2010-12-08 

= spot 

strike 

tau = time to maturity 

ia = riskfree rate 

q = dividend yield 

v = volatility^2 
M 
N 


de oP oP 
K n 
nod 


de Æ 


COIYDMNARWN 
oe 


oe 


9| % = time steps 
10| % = number of paths 


ll|dt = tau/M; 
12|g1 = (r - q - v/2)*dt; g2 = 
13| sumPayoff = 0; T = 0; Q = 0; 
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14/s = log(S); 
15) för n s E:N 


16 ee = g2* randn(M,1); 

17 z = gl + ee; 

18 z = cumsum(z)+s; 

19 Send = exp(z(end)); 

20 payoff = max(Send-X,0); 

21 z = gl - ee; 

22 z = cumsum(z)+s; 

23 Send = exp(z(end)); 

24 payoff = payoff + max(Send-X,0); 
25 payoff = payoff/2; 

26 sumPayoff = payoff + sumPayoff; 
27 % compute variance 

28 if on-> T 

29 T = T + payoff; 

30 Q = Q + (1/(n»(n-1))) * (n*payoff - T)^2; 
31 else 

32 T = payoff; 

33 end 

34| end 


35| call = exp(-r*tau) * (sumPayoff/N); 


Listing 9.11: C-CaseStudies/M/./Ch09/callBSMMC4.m 


function [call,payoff] = callBSMMC4(S,X,tau,r,q,v,M,N) 
% callBSMMC4.m -- version 2010-12-08 

% S = spot 

% X = strike 

% tau = time to maturity 


= riskfree rate 

= dividend yield 

volatility^2 

= time steps 

= number of paths 

dt = tau/M; 

gl = (r - q - v/2)*dt; g2 = sqrt (v»dt); 

s = log(S); 

ee = g2 * randn(M, N); 

z = cumsum(gl+ee, 1) + s; % cumsum(...,1) in case of M=1! 
S = exp(z(end, :)); 

payoff = max(S-X, 0); 

z = cumsum(gl-ee, 1) + s; % cumsum(...,1) in case of M=1! 
S = exp(z(end, :)); 

payoff = payoff + max(S-X, 0); 

payoff = payoff/2; 

payoff = exp(-r*tau) * payoff; 

call = mean (payoff); 


oe 


oe 


DOSTDAN 
oe oe 
23s .Q 8 
i] 


oe 


= 
a 


Ree ee ee ee 
omIANKWN 


NNNWY 
WNrF Oo 


The example is not entirely fair since we use two times the number of paths; while we save the 
generation of the random numbers, they do not account for most of the running time. In general, 


antithetic variates do not always help much, but they are cheap. 


Variance reduction 2: control variates 


To price a vanilla call with the MC approach, we simulate paths of the underlier; for each path Y;, 
we get a payoff; we average the payoffs and get our estimate of the price. More formally, we are 


interested in ¢ = E(g(Y)). Suppose we have another variable, c(Y), and 
(i) o(Y;) and c(Y;) are correlated, and 


(ii) we know E(c(Y)) = c(Y). Note that ¢(Y) will still be a function of the distribution of Y, but 


we will write c. 
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In our example of a vanilla call, c could simply be the underlier’s price. When the stock’s terminal 
price is high, so is the call payoff, and vice versa; so there exists correlation. Under the lognormal 
distribution of Black-Scholes, we know the expected terminal price of the underlier. For a given 
path, we now compute 


g* Yi) = 9%) + B(e%) — 2) 
for a $ to be determined. The expectation of g* is ¢ since the added term is centered about its mean. 
To ensure Var(g*) < Var(g), we need to choose £ appropriately. The variance-minimizing value is 
Cov (4 (Yi), c(¥i)) 
Var(c(¥j)) ` 


This expression should be familiar: it is the formula for the slope coefficient in a linear regres- 
sion (with a constant), yet with a minus in front of it. The function cal1BSMMC5 implements an 
example. 


Listing 9.12: C-CaseStudies/M/./Ch09/callIBSMMC5.m 


function [call,Q]= callBSMMC5(S,X,tau,r,q,v,M,N) 
% callBSMMC5.m -- version 2010-12-10 
% S = spot 
% X = strike 
% tau = time to maturity 
Ë = riskfree rate 
q = dividend yield 
% v = volatility^2 
M 
N 


= time steps 
= number of paths 
dt = tau/M; 


gl = (r - q - v/2)*dt; g2 = sqrt (v»dt); 
sumPayoff = 0; T= 0; Q= 0; 


s = log(S); 
% determine beta 


r el a 
NNMNPWNrF TOWMAAADUNHPWNK 
oe 


nT = 2000; 
17| sampleS = S*xexp(gl*tau + sqrt(v*tau) * randn(nT,1)); 
18| sampleO = exp(-r«tau) * max(sampleS - X, 0); 
19| aux = [ones(nT,1) sampleS]\sampleo; 
20| beta = -aux(2); 
21| expS = S*xexp((r-q)*tau); % expected stock price 
22|% run paths 


23| Lor n = TN 


24 z = gl + g2*randn(M,1); 

25 z = cumsum(z) + s; 

26 Send = exp(z(end)); 

27 payoff = max(Send-X, 0) + betax(Send - expS); 
28 sumPayoff = payoff + sumPayoff; 

29 % compute variance 

30 if n>l 

31 T = T + payoff; 

32 Q=Q + (1/(n*(n-1))) * (n*payoff - T)%2; 
33 else 

34 T = payoff; 

35 end 

36| end 

37| call = exp(-r«tau) » (sumPayoff/N); 


Example calls for all functions. 
Listing 9.13: C-CaseStudies/M/./Ch09/exBSM.m 


l|% exBMS.m -- version 2010-12-08 
S = 100; 
3) X = 100; 
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4| tau = DA 
5r = 0.03; 
6lqa = 0.05; 
7v = 0.2%2; 
8 
9|% MC parameters 
10|M = 1; 
11)N = 100000; 
12 
13| %% analytic solution 
14| tic, call = call1BSM(S,X,tau,r,q,v); tstoc; 
15| fprintf ('The analytic solution.\ncall price %6.2£. It took %6.2£ seconds.\n’, 
call,t) 
16 
17) %% MC 
18| tic, [call, payoff] = callBSMMC(S,X,tau,r,q,v,M,N); t=toc; 
19| SE = std(payoff)/sqrt(N); 
20| fprintf£(’\nMC 1 (vectorized).\ncall price %6.2£. lower band %6.2f. upper 
band %6.2f. width of CI %6.2f. It took %6.2£ seconds.\n’,... 
21 call, -2*SE+call, 2*SE+call, 4*SE, t) 
22 
23 
24|%% pathwise 
25| tic, [call, Q] = callBSMMC2(S,X,tau,r,q,v,M,N); t=toc; 
26| SE = sqrt(Q/N)/sqrt(N) ; 
27| fprintf£(’\nMC 2 (loop).\ncall price %6.2f. lower band %6.2f. upper band 
%6.2f. width of CI %6.2f. It took %6.2f seconds.\n’,... 
28 call, -2*SE+call, 2*SE+call, 4*SE, t) 
29 
30 
31|%% variance reduction: antithetic 
32| tic, [call, Q] = callBSMMC3(S,X,tau,r,q,v,M,N); t=toc; 
33| SE = sqrt(Q/N) /sqrt(N); 
34) fprintf(’\nMC 3 (loop), antithetic variates.\ncall price %6.2f. lower band 
S6.2£. upper band %6.2f. width of CI %6.2f. It took %6.2f£ seconds. \n 
35 call, -2*SE+call, 2*SE+call, 4*SE, t) 
36 
37 
38| 3% variance reduction: antithetic 
39| tic, [call, payoff] = callBSMMC4(S,X,tau,r,q,v,M,N); t=toc; 
40| SE = std(payoff)/sqrt(N); 
41| fprintf£(’\nMc 4 (vectorized), antithetic variates.\ncall price %6.2£. lower 
band %6.2f. upper band %6.2f. width of CI %6.2f. It took %6.2f£ 
seconds.\n’,... 
42 call, -2*SE+call, 2*SE+call, 4*SE, t) 
43 
44 
45|%% variance reduction: control variate 
46| tic, [call, Q] = callBSMMC5(S,X,tau,r,q,v,M,N); t=toc; 
47| SE = sqrt(Q/N)/sqrt(N) ; 
48| fprint£(’\nMC 5 (loop), control variate.\ncall price %6.2£. lower band %6.2f. 
upper band %6.2f. width of CI %6.2f. It took %6.2£ seconds.\n’,... 
49 call, -2*SE+call, 2*SE+call, 4*SE, t) 


We see that with antithetic and control variates, the standard errors are smaller, hence the confidence 
intervals are tighter. But at the same time the functions are also slower, and take more time to 
implement. 


The analytic solution. 
call price 50.5%, It took 0.00 seconds. 


MC 1 (vectorized). 
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call price 5.05; lower band 5.00. upper band BOs 
width of CI 0.41. It took 0.04 seconds. 

MC 2 (loop). 

call price 5 07. lower band 501. upper band 52, 
width of CI 0T It took 0.19 seconds. 


MC 3 (loop), antithetic variates. 
call price 5708; lower band 5 05's upper band 5.11. 
width of CI 0.06. It took 0.35 seconds. 


MC 4 (vectorized), antithetic variates. 
call price 5206; lower band SRE upper band 50u 
width of CI 0.06. It took 0.02 seconds. 


MC 5 (loop), control variate. 
call price 5:05. lower band 5:03 upper band 5.08. 
width of CI 0.05. It took 0.22 seconds. 


The Heston model 
Heston (1993) modeled the stock price S and its variance v by the following two equations: 


dS; = (r — q) Sidt + Jv; S;dzt? (9.10) 
du, = K (8 — v,)dt + o Jv;dz . (9.11) 


In this model we now have two SDEs, but they still belong to the general class of equations de- 
scribed in (9.2): for the price equation we have g1 (S, t) = (r — q) S, and g2(S, t) = ./v; Sr, for the 
variance equation we have gı(v, t) = «(0 — v) and g2(v, t) = o /vr. The Wiener processes zO 
have correlation p. The long-run variance is 0, and current variance vo reverts back to this mean 
value with speed «x; o is the volatility of volatility. For ø — 0, the Heston dynamics approach those 
of Black-Scholes. We will come back to the Heston model in Chapter 17. For this model we cannot 
(at least not easily) jump to the final time, since the volatility at each step depends on the current S; 
it is state dependent. So we need the path. The function callHestonMC prices a call under the 
Heston model. 


Listing 9.14: C-CaseStudies/M/./Ch09/callHestonMC.m 


1] function [call,Q] = callHestonMC(S,X,tau,r,q,v0,vT,rho,k,sigma,M,N) 
2|% callHestonMC.m -- version 2011-01-08 

3l% S = spot 

4|% X = strike 

5|% tau = time to maturity 

6/3 r = riskfree rate 

7% q = dividend yield 

8|% vO = initial variance 

9|% vT = long run variance (theta in Heston’s paper) 

10|% rho = correlation 

11/3 k = speed of mean reversion (kappa in Heston’s paper) 
12|% sigma = vol of vol 

13/3 M = time steps 

14/3 N = number of paths 


15| dt = tau/M; sumPayoff = 0; 


16}C = [1 rho;rho 1]; C = chol(C); 

17T = 0; Q = 0; 

18| for n = 1:N 

19 ee = randn(M,2); 

20 ee = ee x- C; 

21 vS = log(S); vV = v0; 

22 for t = 1:M 

23 % --update stock price 

24 dS = (r - q - vV/2)x*dt + sqrt (vV)*ee(t,1) *sqrt (dt); 


25 vS = vS + dS; 
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26 % --update squared vol 

27 aux = ee(t,2); 

28 % --Euler scheme 

29 dv = kx (vT-vV)*dt + sigmassqrt(vV) *aux*sqrt (dt); 
30 % --absorbing condition 

31 if vv + dv < 0 

32 vV = 0; 

33 else 

34 vV = vV + AV; 

35 end 

36 % --zero variance: some alternatives 

37 Sif vV + dV < 0, dV = kx (vT-vV)*dt;end;vV = vV + dV; 
38 Sif vV + dV <= 0, AV = kx(vT)*dt;end;vV = vV + dV; 
39 end 

40 Send = exp(vS); 

41 payoff = max(Send-X, 0); 

42 sumPayoff = payoff + sumPayoff; 

43 Scompute variance 

44 ifn>l 

45 T = T + payoff; 

46 Q = Q + (1/(n*(n-1))) * (n»payoff - T)^2; 

47 else 

48 T = payoff; 

49 end 

50| end 

51| call = exp(-r*tau) * (sumPayoff/N); 


The discretized version of the SDE for the variance may well lead to negative variance, in particular 
if volatility of volatility ø is high, and mean reversion x is small. Unfortunately, these are exactly 
the properties we need to reproduce the volatility smile (see Chapter 17). One simple method to 
simulate the model is to repair the volatility SDE if it becomes negative, for instance, by setting 
it to zero (the so-called absorbing condition), or reflecting it, that is, switching the sign of the last 


variate. An example call to the function is given in the next MATLAB script. 


Listing 9.15: C-CaseStudies/M/./Ch09/exHeston.m 


1] % exampleHeston.m -- version 2011-01-16 

2S = 100; % spot price 

3lqa = 0.02; % dividend yield 

4| r s 0.037 % risk-free rate 

5| X = 110; % strike 

6| tau = 0.23 % time to maturity 

Tk = ds % mean reversion speed (kappa in paper) 
8| sigma = 0.6; % vol of vol 

9| rho = -0.7; % correlation 

10) v0 = 0,22; % current variance 

11| vT =.042°2% % long-run variance (theta in paper) 

12 

13|% --solution by integration (taken from Chapter 15) 

14| call = callHestoncf£(S,X,tau,r,q,v0,vT,rho,k,sigma) 

15 

16|M = 200; N = 200000; 

17| [call,Q ]= callHestonMC(S,X,tau,r,q,v0,vT,rho,k,sigma,M,N) ; 
18) SE = sqrt (Q/N)/sqrt(N) ; 

19| [-2*SE+call call 2*SE+call 4«SE] 


9.3.3 Greeks 


A straightforward idea to estimate Greeks is to use a finite difference scheme. Let fy be the partial 
derivative of a function f with respect to x, then we could, for instance, use the finite difference 
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FIGURE 9.8 Forward difference for Greeks: Boxplots for Delta estimates with M = 1 and N = 100,000 for different 
values of h (y-axis). Parameters are S = 100, X = 100, t = 1, r = 0.03, q = 0, and o = 0.2. The solid horizontal line is the 
true BS Delta of 0.60; the dotted lines give Delta +0.05. 


schemes (see also page 68) 


ae f(a t+h,-)— f(,:) 


f&a, +) = A ; (9.12) 

baga 2 are =) gs: (9.13) 
h,)— f(@—h,- 

kaga = ie ) (9.14) 


In Chapter 2, we showed that while mathematically A should be as small as possible, with finite 
numerical precision there is a limit to how small h can become. But that discussion was about de- 
terministic functions. Now, f is the outcome of an MC experiment, and will thus vary substantially 
compared with round-off and truncation error. Suppose the task was to compute a call option’s 
Delta by running two MC simulations; for the second run we slightly increase spot, that is, we use 
S + h as the current price. Taking the difference between the obtained option prices and dividing it 
by h gives the forward approximation of the option’s Delta. Now, if h is very small compared with 
S (e.g., S = 100 and h = 0.01), it can easily happen that in the second run, the call gets a lower 
price, hence we have a negative Delta. 

Let us run an example. We stay in the Black-Scholes world since here we can evaluate the 
Greeks analytically, and so have a benchmark. We use S$ = 100, X = 100, t = 1, r = 0.03, q = 
0, and ./v = 0.2. The analytic Delta for these parameters is 0.60. We compute the Delta via a 
forward difference. We simulate M = 100,000 paths with S = 100, and N = 100,000 paths with 
S = 100+A with h € {0.01, 0.1, 0.2, 0.5, 1, 2}. From the two option values that we get, we compute 
the Delta. We repeat this whole procedure 100 times. Fig. 9.8 gives boxplots of the estimated Delta 
values. The solid horizontal line is the true Delta, the dotted lines give Delta + 0.05. The obtained 
values are mostly completely off the mark, but the results become reasonable for large values of h. 

But, actually, we have violated a basic rule in experimental design. Whenever we want to figure 
out the effect that one variable has, we should not change several variables at the same time. But we 
have: in the second run, we have used new random numbers. So, next, we repeat the experiment, 
but we reuse the random variates (this is often called using common random numbers). So, we 
create one path that starts from S = 100, and then create a path that starts from S$ = 100 + h that 
uses the same random numbers. And indeed the estimates are very different once we use common 
random numbers. Results are shown in Fig. 9.9. Note also that while a greater h induces bias, this 
bias is still not terribly large. The bias disappears for smaller h. 

A straightforward way to reuse random numbers is to (re)set the seed before each MC run. But 
we can essentially reuse the paths, so that we do not really need two or three simulations (two 
for forward or backward differences, three for central differences). There are many clever ideas 
on how to compute Greeks other than by finite differences; see, for instance, Glasserman (2004, 
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FIGURE 9.9 Forward difference for Greeks: Boxplots for Delta estimates with M = 1 and N = 100,000 for different 
values of h (y-axis), but using common random variates. Parameters are S = 100, X = 100, t = 1, r = 0.03, q = 0, and 
o = 0.2. The solid horizontal line is the true BS Delta of 0.60; the dotted lines give Delta +0.05. 


Chapter 7). In any case, the finite difference approach has the advantage that it is very simple, and 
still relatively cheap if we reuse the random numbers. 


9.3.4 Quasi-Monte Carlo 


As pointed out at the beginning of this section, when we estimate an option price by MC we are 
actually computing an expectation, that is, we have to evaluate an integral. In lower dimensions 
(say, dimension one), deterministic integration schemes are much more efficient to evaluate inte- 
grals, that is, it is very likely that a deterministic quadrature rule needs fewer function evaluations 
than an MC approach to obtain the same error. In fact, one desirable property of random numbers 
is equidistribution, so random numbers drawn from a certain range should cover the whole range 
as uniformly as possible. But then, why do we not use a grid? There are two disadvantages: first, 
in higher dimensions we generally cannot use a grid since the number of grid points grows ex- 
ponentially with the number of dimensions (but generally, two or three is not higher dimensions). 
Furthermore, an advantage of MC is that we do not have to compute the number of grid points in 
advance, but could stop if some convergence criterion is satisfied (e.g., the width of a confidence 
interval) which we cannot do with a grid since there is no probabilistic interpretation. 

Nevertheless, the idea to enforce more grid-style equidistribution is the basis of so-called quasi- 
Monte Carlo (see also Section 6.6.2). This is quite a misnomer, for as we pointed out in Chapter 6, 
MC methods are not really random either. Anyway, it is the name by which these methods are han- 
dled. A very readable introduction to quasi-Monte Carlo methods in option pricing is the paper by 
Boyle and Tan (1997). For a more formal treatment, see Niederreiter (1992). 

The essence of quasi-Monte Carlo methods is to replace the usual random numbers by so-called 
low discrepancy (LD) sequences. 


Discrepancy 


Intuitively, the idea of discrepancy is the following: assume a sample of N points from some “area” 
Q (which could be of any dimension). Let vol(Q) be the “size” (volume) of this area. Next, pick 
any subarea from Q, and call it S. If our N points were uniformly distributed on Q, we would 
expect that 


vol(S) __ number of points in Ss. 
vol(Q) ~ N ; 


that is, the number of points in S should be proportional to the volume of S. There are actually 
different formal definitions of discrepancy. Often used is the so-called star discrepancy, which is 
defined as follows. Start with a hypercube of dimension p, and define a subcube S that contains the 
origin and has volume vol(S). Now, 
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number of points in S 


7 vol(S)|. 


star discrepancy = sup 
Se[0, 1)? 


This discrepancy should be as small as possible; so the aim is to find sequences that are uniform on 
the hypercube, that is, sequences with low discrepancy (LD). 
A really random sequence has expected discrepancy 


o (= metn) 
JN 


Important is the v N term, and the fact that there is no p: the dimension is not relevant. LD se- 
quences on the other hand are characterized by an asymptotic discrepancy of 


o (=) l 
N 


So, now p appears (which is bad), but in the denominator we have N instead of N (which is 
good). 

Practically, there are a number of problems with such measures: they hold only asymptotically, 
and it is difficult to compute discrepancy for a given sequence of points. That means that the best 
way to see if a given method really improves over standard MC techniques is often experimentation. 


Van der Corput sequences 


We only look into the simplest type of LD sequences, those named after Johannes van der Corput. 
In more than one dimension, van der Corput (VDC) sequences are called Halton sequences. 

Say we wish to construct the kth number (an integer) in a VDC sequence. We first represent k 
in a basis b: 


k= (dm+++d3dyd\do)p with de{0,1,...,b— 1}. 


The d are the digits; m is the number of digits that we need to represent k in base b, computed as 
m= 1 + |log,(k)|. 


In MATLAB, we cannot set the base of the logarithm; but we can use the fact that log, (k) = 
log(k)/1log(b). In R, the Log function has an optional argument base. 

Next, we flip the order of the digits from left to right and add a radix point!* such that the 
resulting number lies in (0, 1); thus we obtain the kth VDC number H;,(b) in basis b, 


Ay (b) = (0.do dı dz d3---dm)p - 


Converting back to basis 10, we have our “quasi-random number.” 
More compactly, we can write, for any integer k > 1, 


m 


k=) djb (9.15) 
j=0 
and 
m 1 
H; (b) = D (9.16) 
j=0 


where b is the integer base, and the digits d are functions of k and b. 


13. Radix point is the formal name for the symbol that delimits the integer part of a number from its fractional part. 
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We need an example. Let us compute the 169th VDC number in basis 3 (i.e., k = 169 and 
b=3): 
k = 169 = (20021)3 
(3) = (0.12002)3 
=1x37'4+2x3-724+0x37?4+0x344+2x3° 
= 0.33333 + 0.22222 + 0 + 0 + 0.00823 
= 0.56378. 


Or, using Eqs. (9.15) and (9.16): 


k=2x 34+0x 3°3+0x3742x3!41x 3? 
Hy9(3) =2 x 35 +0 x 3-440x37742x3741x37!. 


As a check: MATLAB provides the function dec2base that converts an integer into a character 
string that holds the digits in a given base. So the command 


dec2base(169,3) 


returns 


ans = 
20021 


The following function digits does the same, but returns the digits as a vector, which is more 
helpful for our purposes. 


Listing 9.16: C-CaseStudies/M/./Ch09/digits.m 


function dd = digits(k,b) 
% digits.m -- version 2010-12-08 
nD 1 + floor (log(max(k))/log(b)); % required digits 
dd zeros (length (k),nD); 
for i = nD:-1:1 

dd(:,i) = mod(k,b); 

if i>l; k = fix(k/b); end 
end 


OAIYDMBWNe 


Let us use the example a last time: the following table demonstrates step-by-step how dig- 
its works. We first calculate the number of digits that we need (five in this case), and then 
work backward through the digits: we divide by b, set the digit equal to the remainder, and set 
k = floor (k/b). (To build intuition, try the trivial example digits (169,10).) 


j k remainder after division by b 
0 169 1 % the last bit 

1 56 2 

2 18 0 

3 6 0 

4 2 2 


Checking Eq. (9.15), the following command should give 169: 
digits(169,3) » (b.*%(4:-1:0))’ % taking the scalar product 

It is now only a short step to a VDC number; see Eq. (9.16). Entering 
sum(digits(169,3) ./ b.*(5:-1:1) ) 


gives the VDC number mapped into (0, 1). The function VDC does just that. 
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Listing 9.17: C-CaseStudies/M/./Ch09/VDC.m 


1] function vv = VDC(k,b) 

2|% VDC.m -- version 2010-12-08 

3)nD = 1 + floor(log(max(k))/log(b)); % required digits 
4| nN = length(k); % number of VDC numbers 
5| vv = zeros(nN,nD) ; 

6| for i = nD:-1:1 

7 vv(:,i) = mod(k,b); 

8 if i>l; k = fix(k/b); end 

9| end 

10| ex = b .^ (nD:-1:1); 

ll| vv = vv ./ (ones (nN, 1)xex); 

12| vv = sum(vv,2); 


The argument k of this function can be a vector: 


vbc (1:100,7) % generates first 100 VDC numbers in base 7 
VDC (1:2:100,7) % generates VDC numbers 1, 3, 5 ... in base 7 


Table 9.1 gives the first ten VDC numbers in base 2. 

There are ingenious approaches to accelerate these computations by updating from the last 
number (since they require loops, they need not be faster in MATLAB, though they will usually be 
memory saving); the VDC sequence can in fact be updated without explicitly computing the base b 
expansion in every iteration. See Glasserman (2004) for examples and references. 

If we want VDC sequences in more dimensions (Halton sequences), we just generate sequences 
with different bases b that are coprime (i.e., have no common divisor except 1). Thus, a natural 
choice is to pick prime numbers as bases. Table 9.2 lists some prime numbers, just in case. 

Smaller bases are generally preferred since for a fixed number of steps the VDC numbers are 
more uniform; in a higher base we need more numbers to really “fill” our range. An example: the 
VDC numbers 1 to 20 in base 59 lie in the range (0.017, 0.339) and not in (0, 1). Fig. 9.10 gives 
some examples. This example points to a property of VDC numbers: they consist of monotone sub- 
sequences of length b (Glasserman, 2004, p. 294). For instance, in base 59, any sequence 1, 2,... 
will consist of 59 increasing points. 

We can use a VDC sequence to price a BS call. We create uniforms with the function VDC and 
transform these into Gaussian variates with the inverse (we use MATLAB’s norminv function). 
Convergence is shown in Fig. 9.11 (cf. Fig. 9.7). 


TABLE 9.1 The first 10 VDC numbers in base 2. 


# in base 2 mapped to (0, 1) 
1 (0001)> 0.5000 2 
e O 
2 (0010) 0.2500 
O O e 
3 0011 0.7500 
( )2 e o . O 
4 (0100)2 0.1250 a © S a o 
5 (0101)2 0.6250 © © O0 0 0 © 
6 (0110)2 0.3750 © © © © © © 0 
@o O Oo O oO . . 
7 01 0.8750 
. o. o . oeo © e 
8 (1000)2 0.0625 
OO oeo o.o ie) ie) 
9 (1001). 0.5625 
10 (1010) 0.3125 o 1 
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TABLE 9.2 Prime numbers smaller than 1000. 


11 13 17 19 DS, 29 31 37 41 43 
67 71 a 79 83 89 97 101 103 107 
137 139 149 151 157 163 167 173 179 181 
2A 223 BDI 229 233 229 241 251 ZOU. 263 
269 271 277 281 283 293 307 311 313 317 331 337 347 349 
353 359 367 373 379 383 389 397 401 409 419 421 431 433 
439 443 449 457 461 463 467 479 487 491 499 503 509 521 
523 541 547 557 563 569 571 577 587 593 599 601 607 613 
7 653 659 661 673 677 683 691 701 
3 751 757 761 769 HIS) 787 797 809 
9) 

1 


853 857 859 863 877 881 883 887 
947 953 967 971 977 983 991 997 
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FIGURE 9.10 Top left: scatter of 500 against 500 points generated with MATLAB’s rand. Top right: scatter of 500 against 


500 generated with Halton sequences (bases 2 and 7). Bottom left: scatter of 500 against 500 generated with Halton se- 
quences (bases 23 and 29). Bottom right: scatter of 500 against 500 generated with Halton sequences (bases 59 and 89). 


Listing 9.18: C-CaseStudies/M/./Ch09/callBSMQMC.m 


l| function call = callBSMQMC(S,K,tau,r,q,v,N) 
2|% callBSMQMC.m -- version 2010-12-08 

31% S = spot 

4|% X = strike 

5|% tau = time to mat 

6% r = riskfree rate 

7% q = dividend yield 

8l% v = volatility^2 

9| g1 = (r - q - v/2)*tau; g2 = sqrt (v»tau); 
10)U) = VDC(1:N,7); 

ll|ee = g2 * norminv (U); 

12|z = log(S) + gl + ee; 

13 = exp(z); 
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14| payoff = exp(-r*«tau) * max(S-K,0); 
15| call = sum(payoff) /N; 


5 
10° 104 10° 108 107 


FIGURE 9.11 Convergence of price with quasi-MC (function ca11BSMQMC). The light gray lines show the convergence 
with standard MC approach. 


Dimensionality 


In the example, we priced the European option by just sampling S+; the dimensionality is one. If 
we price an option whose payoff depends on the terminal prices of p underliers, the dimensionality 
is p. Now, suppose we create paths for one stock S with M time steps. We have a dimension of M 
because we sample from S1, S2, and so on. We cannot use a sequence in a single base to generate 
a path. This is easy to “prove”: recall that a VDC in base b consists of monotonously increasing 
subsequences of length b. Suppose we transform the uniforms with the inverse of the Gaussian 
which preserves monotonicity. The commands 


plot (cumprod(1+norminv (VDC (1:1000,3))/100),’k’), hold on 
plot (cumprod(1+norminv (VDC (1:1000,13))/100),’k’ 

plot (cumprod(1+norminv (VDC (1:1000,31))/100),’k 
result in Fig. 9.12. 

However, there is the notion of “effective dimension.” In the case in which we have p terminal 
prices, but all prices are highly correlated, the effective dimension is smaller than p (Tan and 
Boyle, 1997). Likewise, for a sampled path the effective dimensionality is not M. Assume we have 
sampled a path up to step M — 1, and suppose M is large. Then, clearly, the final step makes little 


difference. In other words, the early random variates will be more important. Every stock price 
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FIGURE 9.12 Three paths generated with bases 3, 13, and 31. 
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after time step 1 will contain the shock of step 1; but the final variate will have influence only in 
the last step. Therefore, several authors have suggested using few, large time steps generated with 
LD sequences, that is, creating a “coarse” time path, and then building the remaining points by 
a Brownian bridge. For an implementation, see Brandimarte (2006, Chapter 8). Note that for the 
coarse time path we could not use the function bridge given earlier in this section since it uses 
random numbers sequentially. 

In conclusion, LD sequences are a way to speed up convergence; at the same time they offer 
scope for subtle errors (i.e., where outcomes look reasonable but are faulty). In Fig. 9.12 the error 
was easy to spot—once we had plotted the paths. It may not always be so obvious. Thus, extensive 
testing is required. One of the advantages of MC methods was their simplicity; this feature is not 
necessarily shared by quasi-Monte Carlo methods. 
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Optimization is essentially a practical tool and one principally used by non-mathematicians; in contrast, 
most research papers in optimization are written in a style that is only intelligible to a mathematician. 


(Murray, 1972, p. vii) 


In this chapter, we discuss financial optimization models in general: how models are set up, how 
they are solved, and how obtained solutions are evaluated. The chapter concludes with several 
examples of financial optimization models, some of which will be revisited in later chapters. Fol- 
lowing William Murray’s quotation, all descriptions will be informal. 


10.1 What to optimize? 


An optimization model consists of an objective function (also called optimization criterion or goal 
function) and constraints. For all the applications discussed in later chapters, the objective function 
is scalar valued; it takes as arguments a vector x of decision variables, and the data. So our basic 
problem is 


minimize f (x, data), 
x 


subject to constraints. “Minimize” is not restrictive: if we wanted to maximize, we would minimize 
— f. The model specification will be determined by considering and balancing different aspects: 


financial The straightforward part. We define goals and the means to achieve them; and we make 
these notions precise. For an asset allocation model, for instance, we may need to decide 
how to define risk—for example, variability of returns, average losses, probability of failing 
to meet an obligation—or how to measure portfolio turnover. 

empirical In finance, we are rarely interested in the past. (Even though we should be. Reading 
the works of Charles Kindleberger or John Kenneth Galbraith shows why.) All models deal 
with future—and hence unknown—quantities. So we need to forecast, estimate, simulate, or 
approximate these quantities. Building the model (the finance part) must not be separated 
from the empirical part; we may only deal with quantities that we can forecast to a sufficient 
degree. This is a strong, yet unavoidable, constraint on formulating the model. 

computational There is a second constraint: the model must be solvable. Computational aspects 
of such models are the topic of this part of the book. We will argue that they are much less of 
a constraint than is sometimes thought. 
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We can think of model building as a meta-optimization in which we try to obtain the best possible 
results (or more realistically, good results; results that improve the current status) for our financial 
goals subject to restrictions that the model remains empirically meaningful and can be solved. 

We will briefly discuss the ingredients of a model here. The objective function is often given by 
the problem at hand. In asset allocation, for instance, we may want a high return and low risk; when 
we calibrate a model, we want to choose parameter values so that the output of the model is “close” 
to observed quantities. Of course, these descriptions need to be made more precise. There are many 
ways to define what “close” means. We have to select, say, a specific norm of absolute or relative 
differences. In fact, many problems can be spelled out in different ways. When we estimate interest 
rate models, we may look at interest rates, but also at bond prices. When we look at options, we 
may work with prices, but also with implied volatilities. Bond prices are functions of interest rates; 
there is a bijective relationship between option prices and Black—Scholes-implied volatilities. Con- 
ceptually, different model formulations may be equivalent; numerically and, especially, empirically 
they are often not. Specific choices can make a difference, and they often do. 

We can phrase this more to the point: can we directly optimize the quantities that we are in- 
terested in? The answer is No; at least, Not Always. A well-known example comes from portfolio 
selection. If we wanted to maximize return, we should not write it without safeguards into our ob- 
jective function. The reason is that (i) we cannot predict future returns very well, and (ii) there is 
a cost involved in failing to correctly predict: the chosen portfolio performs poorly. Theory may 
help, but determining a good objective function—one that serves our financial goals—ultimately is 
an empirical task. 

Most realistic problems have constraints. Constrained optimization is generally more difficult 
than the unconstrained case. Constraints may, like the objective function, be given by the problem 
at hand. In asset allocation we may have legal or regulatory restrictions on how to invest. Empir- 
ically, restrictions often have another purpose. They act as safeguards against optimization results 
that follow more from our specific data, rather than from the data-generating process. This can 
concern out-of-sample performance: in portfolio selection, there is much evidence that imposing 
maximum position sizes helps to improve performance. But constraints can also help to make esti- 
mated parameters interpretable. For example, variances cannot be negative, and probabilities must 
lie in the range [0, 1]. Yet when such quantities are the outputs of a numerical procedure, we are 
not guaranteed that these restrictions are observed, and so we need to make sure that our algorithms 
yield meaningful results. 


10.2 Solving the model 
10.2.1 Problems 


The focus of this book is neither on financial nor empirical aspects of optimization models, but on 
their numerical solution. In the coming chapters we describe a number of optimization models that 
cannot be solved with standard methods, that is, models that pose difficult optimization problems. 
This is not meant presumptuously; it simply reflects the fact that many problems in finance are 
difficult to solve. Heuristic methods which we will describe below can help to obtain good solutions, 
as will be demonstrated in those coming chapters. A clarification is needed here: in optimization 
theory, the solution of a model is the optimum; it is not necessary to speak of “optimal solutions.” 
But that is not the case in practical applications. A solution here is rather the result obtained from a 
computer program. The quality of this solution will depend on the interplay between the problem 
and the chosen method (and chance). 

From a practical perspective, difficulty in solving a problem can be measured by the amount 
of computational resources required to compute a (good) solution. In computer science, there are 
the fields of complexity theory and analysis of algorithms that deal with the difficulty of problems, 
but results are often of limited use in applications. The practical efficiency of algorithms depends 
on their implementation; sometimes minor details can make huge differences. Often the only way 
to obtain results is to run computational experiments, which is exactly what we will do in later 
chapters. 
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But what makes an optimization problem difficult? For combinatorial problems, it is the prob- 
lem size. Such problems have an exact solution method—just write down all possible solutions 
and pick the best one—but this approach is almost never feasible for realistic problem sizes. For 
continuous problems, difficulties arise when: 


e The objective function is not smooth (e.g., has discontinuities) or is noisy. In either case, relying 
on the gradient to determine search directions may fail. An example is a function that needs to 
be evaluated for given arguments by a stochastic simulation or by another numerical procedure 
(e.g., a quadrature or a finite-difference scheme). 

e The objective function has many local optima. 


Practically, even in the continuous case we could apply complete enumeration. We could discretize 
the domain of the objective function and run a so-called grid search. But it is easy to see that this 
approach, just like for combinatorial problems, is not feasible in practice once the dimensionality 
of the model grows. 

Let us look at some stylized examples of objective functions to make these problems clearer. 
First, we picture the nice case. 


x 


We only have one optimum, the function is smooth, and it is convex. But now imagine we have 
functions like this. 


x x 


The left figure shows a kink in the objective function, so we cannot use an analytical derivative. The 
figure on the right has an inflection point, so the objective function is not convex. Newton’s method 
for instance relies on f being globally convex as it will approximate f by a quadratic function, and 
then solve for the minimum of this function (see Fig. 11.9 on page 249). Already such apparently 
innocuous properties may cause trouble for standard optimization methods. 

The functions pictured so far had one minimum, but what if the function looks like this: 


f 


x 


We have two local minima, and one is clearly better than the other. Yet traditional methods will 
easily get trapped in such local (but suboptimal) minima. 

As a final example, we may not be able to evaluate our objective function precisely, but only 
subject to noise. 


x x 


These functions were stylized, but these problems really occur. Several real examples are presented 
below. 
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10.2.2 Classical methods and heuristics 


To solve optimization models, many researchers and practitioners rely on what we call here stan- 
dard or classical techniques. (You may have noticed that we have referred to “standard” methods 
several times before.) Classical methods are, for the purpose of this book, defined as methods that 
require convexity or at least well behaved objective functions since these techniques are often based 
on exploiting the derivatives of the objective function. Classical methods are mathematically well 
founded; numerically, there are powerful solvers available which can efficiently solve even large- 
scale instances of given problems. Methods that belong to this approach are, for instance, linear 
and quadratic programming. The efficiency and elegance of these methods comes at a cost, though, 
since considerable constraints are put on the problem formulation, that is, the functional form of 
the optimization criterion and the constraints. We often have to shape the problem such that it can 
be solved by these methods. Thus, the answer that the final model provides is a precise one, but 
often only to an approximative question. 

An alternative approach that we will describe in this book is the use of heuristic optimization 
techniques. Heuristics are a relatively new development in optimization theory. Even though early 
examples date back to the 1950s or so, these methods have become practically relevant only in re- 
cent decades with the enormous growth in computing power. Heuristics aim at providing good and 
fast approximations to optimal solutions; the underlying theme of heuristics may thus be described 
as seeking approximative answers to exact questions. Heuristics have been shown to work well for 
problems that are completely infeasible for classical approaches (Michalewicz and Fogel, 2004). 
Conceptually, they are often simple; implementing them rarely requires high levels of mathematical 
sophistication or programming skills. Heuristics are flexible: we can easily add, remove, or change 
constraints, or modify the objective function. These advantages come at a cost as well, as the ob- 
tained solution is only a stochastic approximation, a random variable. However, such a stochastic 
solution may still be better than a poor deterministic one (which, even worse, we may not even 
recognize as such) or no solution at all when classical methods cannot be applied. In fact, for many 
practical purposes, the goal of optimization is far more modest than to find the truly best solution. 
Rather, any good solution, where good means an improvement of the status quo, is appreciated. 
And practically, we often get very good solutions. We give examples in later chapters. 

Heuristics are not better optimization techniques than classical methods; the question is rather 
when to use what kind of method. If classical techniques can be applied, heuristic methods will 
practically always be less efficient. When, however, given problems do not fulfill the requirements 
of classical methods (and the number of such problems seems large), we suggest not to tailor the 
problem to the available optimization technique, but to choose an alternative—heuristic—technique 
for optimization. 

We will discuss classical methods in Chapter 11. In Chapter 12, we then give a selective 
overview of heuristic techniques; the remaining chapters of Part III detail the implementation of 
these methods. 


10.3 Evaluating solutions 


Modeling is approximation. As described in Chapter |, whenever we approximate we commit er- 
rors. Hence, a solution to our model will inherit these errors. In Chapter 1, we suggested to divide 
errors into two broad categories: model errors (i.e., empirical errors), and numerical errors. This 
book is mainly about the second type. In finance, however, the first type is much more important. 
And evaluating it is much more difficult. A practical approach is to compare different models of 
which some are “more accurate” than others. Accurate means that a model cannot only be solved 
with sufficient numerical precision, but that it is also economically meaningful. When a model is 
more accurate than another model, it has a lower empirical error, at least regarding some aspect of 
the original problem. Suppose we have two models that serve the same purpose. One model can be 
solved precisely, but is less accurate than a second model that can be only solved approximatively. 
We still can empirically test whether an only moderately good solution to the more accurate model 
provides a better answer to our real problem than the precise solution to the less accurate model. 
Again, an example from portfolio selection can illustrate this point. Markowitz (1959, Chapter 9) 
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compares two risk measures, variance and semi-variance, in terms of cost, convenience, famil- 
iarity, and desirability; he concludes that variance is superior in terms of cost, convenience, and 
familiarity. For variance we can compute exact solutions to the portfolio selection problem; for 
semi-variance we can only approximate the solution. Today, we can empirically test whether, even 
with an inexact solution for semi-variance, the gains in desirability outweigh the increased effort. 

Even if we accept a model as true, the quality of the model’s solution will be limited by the 
quality of the model’s inputs, that is, data or parameter estimates. Appreciating these limits helps 
to decide how precise a solution we actually need. This decision is relevant for many problems in 
computational finance since we generally face a trade-off between the precision of a solution and 
the effort required (most visibly, computing time). Surely, the numerical precision with which we 
solve a model is important; we need reliable methods. Yet, empirically, there must be a sufficient 
precision for any given problem. Any improvement beyond this level cannot translate into gains 
regarding the actual problem any more; only in costs (increased computing time or development 
costs). Given the rather low empirical accuracy of many financial models, this required precision 
cannot be high. More specifically, when it comes to optimization we can decide whether we actually 
need an exact solution as promised by the application of classical methods, or whether a good 
solution provided by a heuristic is enough. 

In principle, of course, there would seem to be little cost in computing precise solutions. Yet 
there is. First, highly precise or “exact” solutions will give us a sense of certainty that can never be 
justified by a model. Second, getting more-precise solutions will require more resources. This does 
not just mean more computing time, but also more time to develop a particular method. Grinold 
and Kahn (2008, pp. 284-285) give an example; they describe the implementation of algorithms 
for an asset selection problem—finding the portfolio with cardinality 50 that best tracks the S&P 
500. This is a combinatorial problem for which an exact solution could never be computed (there 
are about 10’° possible portfolios), hence we need an approximation. A specialized algorithm took 
six months to be developed, but then delivered an approximative solution within seconds. As an 
alternative, a heuristic technique, a Genetic Algorithm, was tested. Implementation took two days; 
the algorithm found similar results, but also needed two days of computing time. A remark is in 
order: the example is from the 1990s. Today, the computing time of a Genetic Algorithm for such 
a problem would be of the order of minutes, perhaps seconds, on a standard PC. Researchers may 
have become cleverer since the 1990s, i.e. they may be developing models faster today; but it is 
unlikely that their improvement in performance matches that of computer technology. 

A final consideration of the quality of a solution is the distinction between in-sample and out-of- 
sample. For financial models, this distinction is far more relevant than any numerical issue. More 
precise solutions will by definition appear better in-sample, but we need to test if this superior- 
ity is preserved out-of-sample. Given the difficulty we have in predicting or estimating financial 
quantities, it seems unlikely that highly precise solutions are necessary in financial models once 
we compare out-of-sample results. Again, we can set up an empirical test here. Suppose we have 
a model for which we can compute solutions with varying degrees of precision. Each solution can 
be evaluated by its in-sample objective function value (its in-sample fit), but each solution also cor- 
responds to a certain out-of-sample quality. We can now sort our solutions by in-sample precision, 
and then see if out-of-sample quality is a roughly monotonous function of this precision, and if the 
slope of this function is economically meaningful. 

To sum up this discussion: the numerical precision with which we solve a model matters, and 
in-sample tests can show whether our optimization routines work properly. The more relevant ques- 
tion though is to what extent this precision translates into economically meaningful solutions to our 
actual problems. 

In the remaining part of this chapter, several examples for optimization problems that cannot 
be solved with classical methods will be given. This is not a survey, but a selection of problems 
to illustrate and motivate the use of heuristic methods in finance. Some of these problems will 
be revisited in later chapters. For more detailed studies, see for example Maringer (2005b). Many 
references to specific applications can be found in Schlottmann and Seese (2004) or Gilli et al. 
(2008). 
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10.4 Examples 
Portfolio optimization with alternative risk measures 


In the framework of modern portfolio optimization (Markowitz, 1952, 1959), a portfolio of as- 
sets is often characterized by a desired property, the reward, and something undesirable, the risk. 
Markowitz identified these two properties with the expectation and the variance of returns, re- 
spectively, hence the expression mean-variance optimization. By now, there exists a large body of 
evidence that financial asset returns are not normally distributed (see, for instance, Cont, 2001), 
thus, describing a portfolio by only its first two moments is often regarded as insufficient. Alter- 
natives to mean-variance optimization have been proposed, in particular, replacing variance as the 
risk measure. 

Assume an investor has wealth vg and wishes to invest for a fixed period of time. A given 
portfolio, as it comprises risky assets, maps into a distribution of wealth vy at the end of the period. 
The optimization problem can be stated as follows 


mins f(vr(x)) 
xe <xj< aa jeJ 
Kint < #{J} < Ksup 


... and other constraints . 


The objective function f(-) could be a risk measure or a combination of multiple objectives to be 
minimized. Candidates include the portfolio’s drawdown, partial moments, or whatever else we 


wish to optimize. The vector x stores the weights or the units (integer numbers) of assets held. 
P 


x" and x; are vectors of minimum and maximum holding sizes, respectively, for those assets 
included in the portfolio (i.e., those in the set J). If short sales are allowed, this constraint could 
be modified to ae <|xj|< a Kinf and Kgyp are cardinality constraints which set a minimum 
and maximum number of assets to be in J. There may be restrictions on transaction costs (in 
any functional form) or turnover, lot size constraints (i.e., restrictions on the multiples of assets 
that can be traded), and exposure limits. We may also add constraints that, under certain market 
conditions, the portfolio needs to behave in a certain way (usually give a required minimum re- 
turn). 

Similar to this framework are index-tracking problems. Here investors try to replicate a prede- 
fined benchmark; see, for example, Gilli and Kéllezi (2002). This benchmark need not be a passive 
equity index. In the last few years, for instance, there have been attempts to replicate the returns of 
hedge funds; see Lo (2008). 

Applying alternative risk measures generally necessitates using the empirical distribution of 
returns. (There is little advantage in minimizing kurtosis when stock returns are modeled by a 
Brownian motion.) The resulting optimization problem cannot be solved with classical methods 
(except for special cases like mean-variance optimization). To give an example, Fig. 10.1 shows 
the search space, that is, the values of the objective function that particular solutions map into, for 
a problem where f is the portfolio’s Value-at-Risk. The resulting surface is not convex and not 
smooth. Any search that requires a globally convex model, like a gradient-based method, will stop 
at the first local minimum encountered, if it arrives there at all. 

For some objective functions, the optimization problem can be reformulated to be solved with 
classical methods; examples are Gaivoronski and Pflug (2005) or Rockafellar and Uryasev (2000); 
Chekhlov et al. (2005). But such solutions are problem specific and do not accommodate changes 
in the model formulation. How to use heuristics for portfolio selection will be discussed more 
thoroughly in Chapter 14. 
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FIGURE 10.1 Objective function for Value-at-Risk. 


Model selection 


Linear regression is a widely used technique in finance. A common application are factor models 
where the returns of single assets are described as functions of other variables. Then 


By 
r=[Fi- &] : bane (10.1) 


Bx 


with r being a vector of returns for a given asset, F; are the vectors of factor realizations, 6 are 
the factor loadings, and € captures the remaining variation. Such models are widely applied for 
instance to construct variance—covariance matrices or in attempts to forecast future returns. The 
factors F may be macroeconomic quantities or firm specific characteristics; alternatively, the ana- 
lyst may use statistical factors, for instance extracted by principal component analysis. In practice, 
observable factors are often preferred since they are easier to interpret and can be better explained 
to clients. Given the vast amounts of financial data available, these factors may have to be picked 
from hundreds or thousands of available variables, since we may also consider lagged variables 
(Maringer, 2004). Model selection becomes a critical issue, as we often wish to use only a small 
number of k regressors from K possible ones, with K >> k. We could use an information crite- 
rion, which penalizes additional regressors, as the objective function. Alternatively, techniques like 
cross-validation can be applied or the problem can be formulated as an in-sample fit maximization 
under the restriction that k is not greater than a (small) fixed number. 


Robust/resistant regression 


Empirical evidence has shown that the Capital Asset Pricing Model (CAPM) explains asset re- 
turns in the cross-section rather poorly (Fama and French, 1993, 2004). When we re-interpret the 
CAPM as a one-factor model (Luenberger, 1998, Chapter 8), however, the 6 estimates become 
useful measures of a stock’s general correlation with the market, which may be used to construct 
variance—covariance matrices (Chan et al., 1999). 

The standard method to obtain parameter estimates in a linear regression is Least Squares. Least 
Squares has appealing theoretical and practical (numerical) properties, but obtained estimates are 
often unstable in the presence of extreme observations which are common in financial time series 
(Chan and Lakonishok, 1992; Knez and Ready, 1997; Genton and Ronchetti, 2008). Some earlier 
contributions in the finance literature suggested some form of shrinkage of extreme 6 estimates 
towards more reasonable levels, with different theoretical justifications (see for example Blume, 
1971, or Vasicek, 1973). Alternatively, the use of robust or resistant estimation methods to ob- 
tain the regression parameters has been proposed (Chan and Lakonishok, 1992, Martin and Simin, 
2003). Among possible regression criteria, high breakdown point estimators are often regarded as 
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FIGURE 10.2 LMS objective function. 


desirable. The breakdown point of an estimator is the smallest percentage of outliers that may cause 
the estimator to be affected by a bias. The Least Median of Squares (LMS) estimator, suggested 
by Rousseeuw (1984), ranks highly in this regard, as its breakdown point is 50%. (Note that Least 
Squares may equivalently be called Least Mean of Squares.) 

Unfortunately, LMS regression leads to nonconvex optimization models, a particular search 
space for the simple model y = fj + 2x + € is shown in Fig. 10.2. Estimation will be discussed 
in Chapter 16. 


Agent-based models 


Agent-based models (ABM) abandon the attempt to model markets and financial decisions with 
one representative agent (Kirman, 1992). This results in models that quickly become analytically 
intractable, hence researchers rely on computer simulations to obtain results. ABM are capable 
of producing many of the “stylized facts” actually observed in financial markets like volatility 
clustering, jumps or fat tails. For overviews on ABM in finance, see, for example, LeBaron (2000, 
2006); see also Section 8.8. 

Unfortunately, the conclusion of many studies stops at asserting that these models can in prin- 
ciple produce realistic market behavior when parameters (like preferences of agents) are specified 
appropriately. This leads to the question of what appropriate values should be like, and how differ- 
ent models compare with one another when it comes to explaining market facts. 

Gilli and Winker (2003) suggest estimating the parameters of such models by indirect infer- 
ence. This requires an auxiliary model that can easily be estimated, which in their case is simply a 
combination of several moments of the actual price data. A given set of parameters for the ABM is 
evaluated by measuring the distance between the average realized moments of the simulated series 
and the moments obtained from real data. This distance is then to be minimized by adjusting the pa- 
rameters of the ABM. Winker et al. (2007) provide a more detailed analysis of objective functions 
for such problems. 

Fig. 10.3 shows the resulting search space for a particular ABM (see Kirman, 1993). The objec- 
tive function does not seem too irregular at all, but since the function was evaluated by a stochastic 
simulation of the model, it is noisy and does not allow for the application of classical methods. 


Calibration of option-pricing models 


Prices of options and other derivatives are modeled as functions of the underlying securities’ char- 
acteristics (Madan, 2001). Parameter values for such models are often obtained by solving inverse 
problems, that is, we try to obtain parameter values for which the model gives prices that are close 
to actual market prices. In case of the Black-Scholes model, only one parameter, volatility, needs 
to be specified, which can be done efficiently with Newton’s method (Manaster and Koehler, 1982). 
More recent option-pricing models (see, for instance, Bakshi et al., 1997, or Bates, 2003) aim to 
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FIGURE 10.3 Simulated objective function for Kirman’s model for two parameters. 


generate prices that are consistent with the empirically observed implied volatility surface (Cont 
and da Fonseca, 2002). Calibrating these models requires setting more parameters, which leads to 
more difficult optimization problems. 

One particular pricing model is that of Heston (1993); it is popular since it gives closed-form 
solutions. “Closed-form” is somewhat deceiving: pricing still requires numerical integration of a 
complex-valued function. Under the Heston model, the stock price (S) and variance (v) dynamics 
are described by 


dS; =r Sidt + s/v Sid W} 
dv; = «(0 — v,)dt + o vd W? 


where the two Brownian motion processes are correlated, that is, dw!aw? = pdt. As can be seen 
from the second equation, volatility is mean-reverting to its long-run level 0 with speed « in the 
Heston model. In total, the model requires (under the risk-neutral measure) the specification of five 
parameters (Mikhailov and Nögel, 2003). Even though some of these parameters could be estimated 
from the time series of the underlying, the general approach to fit the model is to minimize the 
differences between the theoretical and observed prices. A possible objective function is, hence, 


N 
min $` wa (Ch — CM)? 


n=1 


where N is the number of option quotes available, C" and C™ are the theoretical and actual option 
prices, respectively, and w are weights (Hamida and Cont, 2005). Sometimes the optimization 
model also includes parameter restrictions, for example to enforce the parameters to be such that 
the volatility cannot become negative. 

Fig. 10.4 shows the resulting objective function values for two parameters (volatility of volatility 
and mean-reversion speed) with the remaining parameters fixed. As can be seen, in certain parts 
of the parameter domain the resulting objective function is not too well behaved, hence standard 
methods may not find the global minimum. The Heston model is discussed in Chapter 17. 


Calibration of yield structure models 


The model of Nelson and Siegel (1987) and its extended version, introduced by Svensson (1994), 
are widely used to approximate the term structure of interest rates. Many central banks use the 
models to represent the spot and forward rates as functions of time to maturity; in several studies 
(e.g., Diebold and Li, 2006) the models have also been used for forecasting interest rates. 
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FIGURE 10.4 Left panel: Heston model objective function. Right panel: Nelson—Siegel-Svensson model objective func- 
tion. 


Let y;(t) be the yield of a zero-coupon bond with maturity t at time t, then the Nelson—Siegel 
model describes the zero rates as 


1 —exp(— 1 —exp(— 
Y(T) = Bii + Bou [=e 23 FP [sec Mi) exp) (10.2) 
t t 


where y; = 1/1. The Svensson version is given by 


1— Z 
Y(T) = Bir + Bow ss + (10.3) 
jt 
1— exp(— bee 
B3 [=e = expr.) Bis | SPEAN aspi 2 
Vit y2,t 


where y1,; = t/a), and y2, = t/a2,,. The parameters of the models (8 and A) can be estimated by 
minimizing the difference between the model rates y; and observed rates y)“ where the superscript 
stands for “market.” An optimization problem could be stated as 


i M\2 
min 2 (yr — y’) 
subject to constraints. We need to estimate four parameters for model (10.2), and six for 
model (10.3). Again, this optimization problem is not convex; an example for a search space is 
given in Fig. 10.4. The calibration of the Nelson—Siegel model and Svensson’s extension is dis- 
cussed in more detail in Chapter 16. 


10.5 Summary 


In this chapter, we have tried to motivate the application of optimization models in finance. The 
coming chapters will demonstrate how to solve such models. In Chapter | 1, we will discuss classi- 
cal methods: methods for zero-finding and gradient-based methods. In Chapter 12, we will give a 
brief introduction to heuristic methods. The emphasis of that chapter will be on principles, not on 
details. Later chapters will then discuss the process of solving particular problems with heuristics. 


Chapter 11 


Basic methods 


Contents 

11.1 Finding the roots of f(x) =0 229 11.4.1 Steepest descent method 245 
11.1.1 A naive approach 229 11.4.2 Newton's method 247 
Graphical solution 230 11.4.3. Quasi-Newton method 248 
Random search 231 11.4.4 Direct search methods 250 
11.1.2 Bracketing 231 11.4.5 Practical issues with MATLAB 254 
11.1.3 Bisection 232 11.5 Nonlinear Least Squares 256 
11.1.4 Fixed point method 233 11.5.1 Problem statement and notation 256 
Convergence 235 11.5.2 Gauss-Newton method 257 
Le: eee enon me 11.5.3 Levenberg-Marquardt method 258 
Comments 240 : : : 

11.2 Classical unconstrained optimization 241 1G spiving ystems oinonimeareguatons 

Š F(x)=0 260 

Convergente EE EE 242 11.6.1 General considerations 260 

11.3 Unconstrained optimization in one ; . 
dimension 243 11.6.2 Fixed point methods 262 
11.3.1 Newton's method 243 11.6.3 Newton's method 263 
11.3.2 Golden section search 244 11.6.4 Quasi-Newton methods 268 

11.4 Unconstrained optimization in multiple 11.6.5 Further approaches 269 
dimensions 245 11.7 Synoptic view of solution methods 270 


This chapter is about classic methods for unconstrained optimization, including the special case 
of nonlinear Least Squares. Optimization is also related to finding the zeros of a function; thus, 
the solution of nonlinear systems of equations is also part of this chapter. A variety of approaches 
is considered, depending on whether we are in a one-dimensional setting or we solve problems 
in higher dimensions. To enhance the clarity of the presentation, Fig. 11.1 shows a diagram that 
might help to structure the problems discussed in this chapter. Among the general unconstrained 
optimization problems, we distinguish the one-dimensional case (1-D) and the n-dimensional case 
(n-D), and within each category, we distinguish between gradient-based and direct search methods. 
In Fig. 11.1, the solution of a linear system Ax = b, presented in Chapter 3, is connected by dotted 
arrows to the methods where it constitutes a building block. 

In order to gain insight and understanding of the workings of the methods presented, most 
algorithms have been coded and executed to solve illustrative examples. These codes are not meant 
to be state of the art. We indicate the corresponding functions provided in MATLAB® and briefly 
illustrate their use by solving some of the problems presented in the illustrations. 


11.1 Finding the roots of f(x) =0 


11.1.1 A naive approach 


The problem consists in finding the value of x satisfying the equation 
f(x) =0. 


The solution is called a zero of the function. Several solutions may exist. 
A function like 1 — e~* = 0.5 can be put into the form f(x) = 0 by simply moving all terms to 
the left of the equality sign 


l1-—e*—-05=0. 
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FIGURE 11.1 Synoptic view of methods presented in the chapter. 


For this particular case, it is easy to find the (analytical) solution 
x = —log(0.5). 


However, in practice, an analytical solution does not always exist or may be expensive to get. 
In these cases, we resort to numerical solutions. In the following, we discuss with a number of 
examples several numerical methods for finding the zeros of a function. 


Graphical solution 


x 
Let us consider the expression ( 1+ 1) = x” defined for n > 1. We want to find the zeros in 


the interval x € [xz, xy] by inspection of the plot of the function. This is done with the following 
MATLAB code: 


E: = @(x,n) ( (1 + 1/(n-1)).%x - a A ); 
KL Ss say eT = B3 

x = linspace(xL,xU) ; 

plot(x,f£(x,2)), grid on 


For the definition of the function we use MATLAB’s anonymous function syntax. Note that f is 
coded using element-wise operations so that a vector can be used as an argument. From the plot, 
we can read that in the interval [—2, 5] the function takes the value of zero for x ~ —0.75, x ~ 2, 
and x X 4. 
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Random search 


We randomly generate values for x and compute the corresponding value of f (x). Given X, the set 
of generated values, the solution x5! is defined as 


x! = argmin| f (x)|. 
xE 


For the generation of the x we use the MATLAB function rand, which generates uniform random 
variables in the interval [0, 1]. For given values of xz, xy, and a realization of the uniform random 
variable u, we generate x as 


xX=xpt+(xy — xL)u. 


Evaluating the function f (x) of the previous example for R = 10° randomly generated values of x 
in the interval [—2, 5] we obtain the following results for four executions. 


£ = @(x,n) ( (1+(1/(n-1))).*%x - x.%n ); 
xL = -2; XU = 5; n = 2; R = 1e6; 
for k = 1:4 
x = xL + (xU - xL)*rand(R,1); 2 = E(3 n): 
[sol,i] = min(abs(z)); 
fprintf(’ £(%9.6f) = %8.6f\n’,x(i),sol); 
end 
£( 4.000000) = 0.000001 
£( 4.000001) = 0.000003 
£(-0.766663) = 0.000003 
£( 1.999999) = 0.000002 


11.1.2 Bracketing 


With this technique, we try to construct intervals likely to contain zeros. Later, the search can be 
refined for a given interval. We divide a given domain [xz, xy] into regular intervals and examine 
whether the function crosses the abscissa by checking the sign of the function evaluated at the 
borders of the interval. Algorithm 29 details the procedure. 


Algorithm 29 Bracketing. 


initialize x, xy and n a b A 
A Sian H } } + + } 1 
a= XL 
fori = 1:n do f(a) 
b=a+A 
if sign f(a) # sign f (b) then 
# may contain a zero, save [a, b] 
end if a 
a=b 
end for f(b) 


WN Re 


W SON A: e 
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The MATLAB implementation is given hereafter. 
Listing 11.1: C-BasicMethods/M//Ch1 1/Bracketing.m 


l| function I = Bracketing(f,xL,xU,n) 
2|% Bracketing.m version 2010-03-27 
3| delta = (xU - xL)/n; 

4'a =x; k = 0; I = []; 

5| fa = feval(f,a); 

6| for i= i:n 

7 b =a + delta; 
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8 fb = feval(f,b); 

9 if sign(fa)~=sign (fb) 
10 k=k+4+ 1; 

11 I(k,:) = [a bl]; 

12 end 

13 a =b; fa = £b; 

14| end 


Example for bracketing zeros of the function g(x) = cos(1/x?) in the domain x € [0.3, 0.9] with 
n = 25 intervals. 


E = @(x) cos(1./x.%2); 

I = Bracketing(f,0.3,0.9,25) j l l 

I = : : ; 
0.3000 0.3240 obtem [\ TE : EEE T aiu J 
0.3480 0.3720 5 V: ee 
0.4440 0.4680 0.2 0.4 0.6 0.8 1 
0.7800 0.8040 


11.1.3 Bisection 


The method of bisection seeks the zero of a function in a given interval [a, b] by halving the interval 
and identifying the semi-interval containing the zero. The procedure is reiterated until a sufficiently 
small interval is reached (see Algorithm 30). 


Algorithm 30 Bisection. 


1: define error tolerance n 


2: if sign f(a) =sign f(b) then stop f(a) 

3: while |a — b| > n do 

4: c=a + (b — a)/2 

5: if sign f(a) # sign f (c) then 3 7 

6: b=¢ # (z left of c) > 
7: else a b 

8: a=c # (z right of c) f(c) 

9: end if f(b) 
10: end while 


In Statement 4 the center of the interval is computed as a + (b — a)/2. This is numerically 
more stable than computing (a + b)/2 for large values of a and b. The MATLAB implementation 


follows. 
Listing 11.2: C-BasicMethods/M/./Ch1 1/Bisection.m 
1| function c = Bisection(f,a,b,tol) 
2|% Bisection.m -- version 2005-05-25 
3|% Zero finding with bisection method 
4| if nargin == 3, tol = le-8; end 
5| fa = feval(f,a); fb = feval(f,b); 
6| if sign(fa) == sign(fb) 
7 error(’sign f(a) is not opposite of sign f(b)’); 
8| end 
9| done = 0; 
10| while abs (b-a) > 2*tol & ~done 
11 c=a+ (b- a) / 2; 
12 fc = feval(f,c); 
13 if sign(fa) ~= sign(fc) 
14 be sg; 
15 fb = fo; 
16 elseif sign(fc) ~= sign(fb) 
17 a =c; 
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18 fa = fc; 

19 else % center and zero coincide 
20 done = 1; 

21 end 

22| end 


Hereafter, we use the bisection method to compute the zeros for the intervals identified previ- 
ously with the bracketing method. 


E = @(x) cos{1./x.*2) ; 
I = Bracketing(f,0.3,0.9,25); 


for i = 1i:size({I,1) 

z(i) = Bisection(f,I(i,1),1(1,2)); 
end 
2 

0.3016 0.3568 0.4607 0.7979 


11.1.4 Fixed point method 


Given f(x) = 0, we reorganize the expression of the function in the following way 


x =g(x), (11.1) 


where g is called the iteration function. Inserting a starting value x°% in g we can compute a new 
value x" = ga. Repeating the procedure generates a sequence O hko. 1,.... This defines the 


fixed point iteration which is formalized in Algorithm 31. 


Algorithm 31 Fixed point iteration. 


1: initialize starting value xO 

2: fork =1,2,... until convergence do 
3: x = g(x&-D) 

4: end for 


Before discussing the convergence in more detail a few remarks are given: The complete se- 
quence of values of x needs not to be stored, as only two successive elements are used in the 
algorithm. 


initialize starting value x0 
while not converged do 


xl = g(x0) 
x0=x1 
end while 


The solution satisfies the iteration function x5% = g(x% and is, therefore, also called a fixed 
point as the sequence remains constant from that point on. The choice of the iteration function g(x) 
determines the convergence of the method. An implementation with MATLAB is given hereafter, 
and we discuss the method by solving two problems. 


Listing 11.3: C-BasicMethods/M/./Ch1 1/FPI.m 


function x1 = FPI(f,x1,tol) 
% FPI.m -- version 2008-05-08 
if nargin == 2, tol = 1le-6; end 
it = 0; itmax = 200; x0 = realmax; 
while ~converged(x0,x1,tol) 

x0 = x1; 


DAnkWN HE 
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7 x1 = feval(f,x0); 

8 its it + 1; 

9 if it > itmax, error ('Maxit in FPI’); end 
10| end 


We consider the function x — x” + 1/5 = 0 and the corresponding two iteration functions 
gi@axh— Ys gaa) = E sy”. 


We first use iteration function gı with starting value x0 = 1.2 and observe that the solution does 
not converge, and then we use g2 with starting value x0 = 0.2 for which we get a solution. 


gl = @(x) x^ (7/57175; 
z = FPI (gL,.1 2) 

Error using FPI (line 9) 
axit in FPI 


Z = FPI(g1,1...6) 
Error using FPI (line 9) 
axit in FPI 


g2 = @(x) (x+1/5).*(5/7); 
= FPI(g2,0.2) 


1.3972 
>> z = FPI(g2,2.4) 


1:3972 


Fig. 11.2 plots the iteration functions, gı in the left panel and g2 in the right panel. For gı, the 
divergence is also observed for a starting value of 1.6, whereas g2 converges also for a starting 
value of 2.4. From this graph we already see, what will be formalized below, that convergence is 
linked to the slope of the iteration function. For gj, this slope is greater than 1, and for go, it is 
smaller than 1 (the slope of the bisectrix is 1). 

Now, we consider another function x — 2x5 + 2/3 = 0 and the iteration functions 


+ 2/3\ 3/3 
Bax-a gas) = (AF ) > 
First, we explore the function values for the interval x € [0, 8] with our bracketing algorithm and, 
then, we apply the fixed point method. 


1.8972 F 0s, / 4 1.8972 


1.2 1.6 0.2 2.4 


FIGURE 11.2 Left panel: iteration function g4. Right panel: iteration function go. 
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@(x) x-2* (x) .*(3/5)+2/3; 
I = Bracketing(f,0,8,10) 


0 0.8000 
3.2000 4.0000 


g3 = @(x) 2*x.*(3/5)-2/3; 
z = FPI(g3,0.6) 


3.7623 
z = FPI(g3,4.8) 
3.7623 


g4 = @(x) ((x+2/3)/2)%(5/3); 
z = FPI(g4,3) 


0.2952 
z = FPI(g4,0) 


0.2952 


The two solutions can be seen in Fig. 11.3, in which, for the iteration function g3 and a starting 
value of 0.6, we diverge from the solution to the left (0.2952) because the slope is greater than 1 
and evolve toward the second solution (3.7623). For g4 when starting from 3, we move away from 
the nearest solution and evolve toward 0.2952. We see how this behavior depends on the local value 
of the slope of the iteration functions. 


Convergence 


A necessary condition for a fixed point method to converge for x € [a, b] is that the interval contains 
a zero and that the derivative of the iteration function satisfies 


Ig’(x)| <1 for x€[a,d]. 


As already pointed out before, Fig. 11.2 gives a graphical illustration of this condition: in the left 
panel, we observe that gı is steeper than the bisectrix and the iterations diverge, whereas in the 
right panel, the slope of g2 is less than | and the iterations converge. 

The convergence condition is necessary for local convergence. Nevertheless, starting from a 
point not satisfying the convergence condition the iterations can move into a region where conver- 
gence takes place. This is illustrated in Fig. 11.3 where, starting from x = 0.6 in the left panel and 
x = 3 in the right panel, values for which g'(x) > 1, we converge. In both cases we first diverge 
from the nearest solution and then move to another solution located in a region of convergence. 


3.7623 


0.2952} - - a6, | 


0.6 48 0o 3 


FIGURE 11.3 Left panel: iteration function g3. Right panel: iteration function g4. 
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3.3089 f 7 


1 


FIGURE 11.4 Left panel: iteration function satisfies g'(x) < —1. Right panel: iteration function satisfies —1 < g'(x) <0. 


Finally, in Fig. 11.4, we see that for g’(x) < 0, the iterations oscillate, they converge for —1 < 
g'(x) < 0 (right panel), and diverge for g'(x) < —1 (left panel). 

We see that according to the value of the derivative, an iteration function may, for a specific 
interval, converge or diverge. 


Example 11.1 
The S-estimator is among the solutions proposed in high breakdown point regression estimation.! Its es- 
timation involves finding the zero of a nonlinear function.” The fixed point method is quite appropriate 
to solve this problem, and we illustrate an implementation with MATLAB. 
Consider the linear regression model 
81 


yi = [xi + xip] : + i i=l,...,n, 


where 6 € R? are the parameters to be estimated and u ~ N(O,o7/) is white noise. The residuals are 
rj = yi — Xj,.0 and the S-estimator is defined as 


6, = argmin S(0), (11.2) 
6 


where S(@) is the solution of 


n 


1 


n—p 


o(<)=B. (11.3) 


g= 


p is a weighting function, with k = 1.58 a given constant, and is defined as 


p(s) = |; CW? 3 C/H G/D if Is] Sk 
l if |s| >k 


and £ is a constant computed as 
fo} 
p= [rao 
—0o 


with (s) being the standard normal distribution. The left panel in Fig. 11.5 shows the shape of the 
function px, and we see that the constant k determines the limits beyond which the weights are equal 
to 1. The vector px is computed with the function Rho.m. Note that k is set in the code, and the way 
the powers of s/k are computed is about six times faster than coding exponents. This is important as the 
function will be called many times as the fixed point algorithm is nested in the optimization procedure 
that successively evaluates S(@) in the search for 8s. 


1. Examples with the LMS and LTS estimator are discussed in Section 16.2. 
2. See Marazzi (1992) and Ruppert (1992). 


Basic methods Chapter | 11 237 


Listing 11.4: C-BasicMethods/M//Ch1 1/Rho.m 


function F = Rho(s) 

% Rho.m -- version 2010-11-12 
k = 1.58; 

F = ones(length(s),1); 

I = (abs(s) <= k); 


templ = s(I)/k; 

temp2 = templ .» temp1; 

temp4 = temp2 .* temp2; 

temp6 = temp4 .*« temp2; 

F(I) = 3*temp2 - 3*temp4 + temp6; 


SCOmMNDUNSFWNHE 


= 


To compute £, we have to evaluate the integral of the product of px times the standard normal 
density. As this density rapidly tends to zero, it is sufficient to integrate from —5 to 5. This can also be 
verified in Fig. 11.5 in which the right panel shows the graph of px(s) (s). We computed £ = 0.4919 
using the MATLAB function quad for the integration. 


b = quad(’RhoN’,-5,5) 


b = 
0.4919 
Listing 11.5: C-BasicMethods/M/./Ch11/RhoN.m 
1| function F = RhoN(s) 
2|% RhoN.m -- version 2010-11-12 
3|d = exp(-0.5 * s.%*2) ./ sqrt(2x*pi); 
4|F = Rho(s) .* d’; 


=k 0 k -5 0 5 
FIGURE 11.5 Shape of function p% (left panel) and shape of function pz (s) @(s) (right panel). 

The iteration function is then derived by dividing Eq. (11.3) by £ 
n 


1 i 
(n— p)B Dio(Z=h 


i=1 


multiplying the result by S?, 


2 n 
(n —_ Oe 


and taking the square root, 


S2 k Fi 
os (n— p)B Zola) ae 


The solution method is now tested on a small data set taken from Brown and Hollander (1977). First, 
we only seek S that solves Eq. (11.3) for one given residual vector r. 
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load BrHo.dat % read Brown and Hollander (1977) data 


[n,p] = size(BrHo) ; 

y = BrHo(:,2); X = [ones(n,1) BrHo(:,1)]; 
thetaO = X \ y; % compute OLS estimat 

r= y - X*theta0o; 

b = 0.4919; c = (n - p) * b; 

SO = median(abs(r)) / b; % starting value 


S1 = FPIS(S0O,r,c) 


Sl = 

13.6782 

Listing 11.6: C-BasicMethods/M/./Ch1 1/FPIS.m 

l| function S1 = FPIS(S1,r,c) 
2| if nargin == 3, tol = 1e-4; end 
3 it = 0; itmax = 100; SO = -S1; 
4| while ~converged(S0,S1,tol) 
5 S60 = 61 2 
6 S1 = sqrt( S0*2/c * sum(Rho(r/S0)) ); 
7 it S16 + Jy 
8 if it > itmax, error(’Maxit in FPIS’); end 
9| end 


To complete the exercise, we can think of solving the optimization problem defined in Eq. (11.2). For 
simplicity, we chose the Nelder—Mead direct search method? explained in Section 11.4.4. The MATLAB 
code below calls the £minsearch function which performs the Nelder-Mead search. We note the 
improvement of the value of S with respect to the initial value obtained before. 


load BrHo.dat % read Brown Hollander (1977) data 


[n,p] = size(BrHo); y = BrHo(:,2); X = [ones(n,1) BrHo(:,1)]; 
options = optimset(’fminsearch’) ; 
theta0 = X \ y ; b = 0.4919; p = size(X,2); c = (n - p) * Ð; 


[theta, S, flag] = fminsearch(’OF’,theta0,options,y,X,b,c) 


theta = 
81.8138 
28.5164 


Listing 11.7: C-BasicMethods/M//Ch1 1/OF.m 


l| function F = OF(theta,y,X,b,c) 
2|% OF.m -- version 2010-11-11 
3)r = y - X*theta; 
4 
5 


SO = median(abs(r)) / b; 
F = FPIS(SO,r,c); 


11.1.5 Newton’s method 
Newton’s method is derived from the Taylor expansion of a function f in the neighborhood of x, 
f+ Ax) = f(x) + Af (x) + RO). 


3. There exist specific methods to solve this problem efficiently, see Marazzi (1992) and Salibian-Barrera and Yohai (2006). 
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Ignoring the remainder R(x), we seek the step A, for which we have f(x +A,) =0 
F (Xen) © fe) + Ges — xe) f'@) =0 
Sm ——<$ 
x+y Ax 


from which we can compute the step s from xg to xķ+1, improving the approximation of the zero 
of f (Algorithm 32): 


Algorithm 32 Newton’s method for zero finding. 


: initialize starting value xo 
: fork =0,1,2,... until convergence do fx, 4 


1 

2 

3 compute f (xp) and f’ (xx) 
4 g= fw 
5 

6 


~ F fxg) 4 
Xk+1 =x +s k+1 


: end for Xk+1 Xk 


In the following, we give an implementation with MATLAB in which the derivative is approxi- 
mated numerically. 


Listing 11.8: C-BasicMethods/M/./Ch1 1/Newton0.m 


l| function x1 = Newton0(f,x1,tol) 

2|% Newton0.m -- version 2010-03-30 

3|% Newton method with numerical derivation 
4| if nargin == 2, tol = 1e-8; end 

5|it = 0; itmax = 10; x0 = realmax; h = 1e-8; 
6| while ~converged(x0,x1,tol) 

7 x0 = x1; 

8 f0 = feval(f,x0); 

9 f1 = feval(f,x0 + h); 

10 df = (f1 - £0) / h; 

11 x1 = x0 - f0/df; 

12 it = it + 1z 

13 if it > itmax, error(’Maxit in Newton0’); end 
14| end 


The working of Newton’s method is illustrated in Fig. 11.6 in which we seek zeros for the 
function f(x) = e~* log(x) — x? + *°/3 + 1 for different starting values x0 = 2.750, 0.805, 0.863, 
and 1.915. Note that using a numerical approximation instead of the analytical derivative will have 
little influence on our results. 


f = @(x) exp(-x) .*log(x) -x.%2+x.%3/3+1; 


zl = Newton0(f,2.750 
Za. = 

2.4712 
z2 = Newton0(f,0.805 
Z2 = 

2.4712 
z3 = Newton0(f,0.863 
Zo = 

1.4512 
z4 = Newton0(f,1.915 


Error using Newton0O (line 13) 
Maxit in Newton0o 
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i i i i 
x0 x2x3 x1 x1 x0 
FIGURE 11.6 Behavior of the Newton method for different starting values. Upper left: x0 = 2.750 and xl = 2.4712. 


Upper right: x0 = 0.805 and xS°! = 2.4712. Lower left: x0 = 0.863 and x8°! = 1.4512. Lower right: x0 = 1.915 and 
algorithm diverges. 


Comments 


The choice of a particular method may depend on the application. For instance, the method of 
bisection needs an interval and may, therefore, be considered more robust. For the fixed point and 
Newton’s method the convergence depends on the starting value, and a value close to a zero does 
not necessarily drive the algorithm to this solution (e.g., see Fig. 11.3 where in the left panel, 
x0 = 0.6 goes to x°°! = 3.7623 and not to the closer solution 0.2952). 

Note that for finding the roots of a polynomial, there exist specific algorithms which resort to 
the computation of the eigenvalues of the so-called companion matrix. 

In MATLAB, the function dedicated to finding the zeros of a function is fzero. Below we 
compute the zeros of the function for the illustration of Newton’s method and use the same starting 
values we did before. We observe that the algorithm produces the same result only for the first 
starting value and converges to different solutions for the remaining ones. The default options for 
fzero are initialized with the function optimset. 


f = @(x) exp(-x) .*log(x)-x.*%2+x.%3/3+1; 
options = optimset(’fzero’); 
zl = fzero(f£,2.750) 
Z2 fzero(£,0.805) 
23 fzero(f,0.863); 
z4 = fzero(f,1.915) 
disp([z1 z2 z3 z4]) 
2.4712 0.2907 0.2907 1.4512 


These default options can be overwritten with the function optimset. Below we modify 
the tolerance for the convergence and display the complete list of output arguments computed by 
f£zero. 


options = optimset (options, ’TolFun’,1le-3); 
[x,fval,exitflag,output] = fzero(f,2.750,options) 
x = 

2.4712 
fval = 

-8.8818e-16 
exitflag = 
1 


output = 
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struct with fields: 


intervaliterations: 5 
iterations: 6 
funcCount: 16 
algorithm: ‘bisection, interpolation’ 
message: ‘Zero found in the interval 
[2.43887, 2.97]’ 


11.2 Classical unconstrained optimization 


We consider the (mathematical) problem of finding for a function f : R” — R of several variables 
the argument that corresponds to a minimal function value 


x* = argmin f (x). 
xeR” 


The function f is called the objective function, and x* is the solution (minimizer). The function 
f is assumed to be twice-continuously differentiable. The solution x* € X is a global minimizer 
of f if f(x*) < f(x) Vx € X, where X is the feasible region or constraint set. If 45 > 0| f(x*) < 
f(x) Vx € X N B(x*, ô), x* is called a local minimizer. In this latter case, the function is minimal 
within a region B as defined. Local and global minimizers as well as a saddle point are illustrated 
in the following figure. 


Typically, an optimization problem is unconstrained if X = IR” and constrained if X is described 
by a set of equality (E) and inequality (7) constraints: 


X={xeR"|ci(x)=Oforie E and cj(x)>O0forie/}. 


A maximization problem can easily be reformulated into a minimization problem by changing 
the sign of the objective function. 

Optimization problems come in many different flavors, and the following criteria could be used 
for classification: number of variables, number of constraints, properties of the objective function 
(linear, quadratic, nonlinear, convex, ...), and properties of the feasibility region or constraints 
(convex, only linear or inequality constraints, linear or nonlinear constraints, .. . ). 

For optimization problems with a particular structure, specific algorithms have been developed 
taking advantage of this structure. Below is a short list of optimization problems with special struc- 
ture. 


Linear programming: f(x) and c;(x) are linear. 

Quadratic programming: f (x) is quadratic and c;(x) are linear. 
Convex programming: f(x) is convex and the set X is convex. 
Nonlinear Least Squares: f(x) = 5 Bee Fœ and X = R”. 
Bound-constrained optimization: £; <x; < uj. 


oo 0 0 0 
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o Network optimization: objective function and constraints have a special structure arising from a 
graph. 
In the following, the gradient V f(x) of f(x), x € R”, and the Hessian matrix {v2 fij = 
a f(x) 


are denoted, respectively: 


OX OX; 
PFW PFW 9. PFO) 
af ax? Ox 0x2 9x1 0Xp 
Oxy Pf) ef) Pf) 
Ox20xX ax2 Ox20Xpn 
Vi@=|: and V?f(xy=| a 2 
af 
OXn 
r PIW PEO  PfW 
OxndX1 OxXn 0x2 ax2 
Convergence 


The algorithms in the following presentation are iterative procedures, and the speed with which 
they approach the solution is expressed as the convergence rate defined by the magnitude of the 
exponent r in the expression, 


fee 


’ 


k> [er 
where e% is the error at iteration k and c is a finite constant. The following situations are distin- 


guished: 


e r=1andc <1, linear convergence 
e r > l, superlinear convergence 
e r = 2, quadratic convergence 


In practice, this corresponds to a gain in precision per iteration that is a constant number of 
digits for linear convergence; the increment of precise digits increases from iteration to iteration for 
the superlinear convergence and, in the case of quadratic convergence, the precision doubles with 
every iteration. 


Conditions for local minimizer 


The Taylor expansion at a local minimizer x* is 


FCAD=FOVEVFOV e+ 5 z V? f(x* +E) 
x c1 c2 


with & € [0, 1]. The sufficient conditions for a local minimizer are then given by the first-order 
conditions (C1) 


Vf(x*)=0, 
that is, the gradient has to be zero, and the second-order conditions (C2) 
zZ V f(x*)z2=0 YzeR”, 


that is, the Hessian matrix must be positive-definite. If the latter is not the case, x* is a saddle point. 
The problem is closely related to solving nonlinear equations F (x) = 0, where F(x) corresponds 
to V f(x). 


Classification of methods 


We distinguish two categories of methods for solving the unconstrained optimization problems: 
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o Gradient-based methods 


e Steepest descent 
e Newton’s method 


o Direct search methods 


e Golden section method (one dimension) 
e Simplex method (Nelder—Mead) 


Direct search can be defined as follows: 


e The method uses only objective function evaluations to determine a search direction. 
e The method does not model the objective function or its derivatives to derive a search direction 
or a step size; in particular, it does not use finite differences. 


This definition loosely follows Wright (1996). 


11.3 Unconstrained optimization in one dimension 


Steepest descent has no meaning in one dimension. 


11.3.1 Newton’s method 


For a given value of x, Newton’s method approximates the function by a local model which is 
a quadratic function. The local model is derived from the truncated second-order Taylor series 
expansion of the objective function 


1 
FEA) fE 5 f"@) h°. 


M 
local model 


We then establish the first-order conditions df (x + h)/dh = 0 for the minimum of the local model 


F +h f(x) =0 


from which we derive that h has its minimum for h = — L, suggesting the iteration scheme* 


/ 
x 
yet) — ,@ _ FO) 
FR) 
Algorithm 33 resumes Newton’s method for unconstrained optimization of a function in one di- 
mension. 


Algorithm 33 Newton’s method for unconstrained optimization. 


1: initialize x) (close to the minimum) 

2: fork =0,1,2,... until convergence do 
3 compute f/(x) and f” (x) 

a (kt) = yk) _ EW 
5 


7) 
: end for 


As an example, we consider the function 
—x2 
f(x)=1—log@je™ , 
where the first- and second-order derivatives, respectively, are 


f'(@) =e /x) +2 log(x) xe, 


4. We recognize that this corresponds to Newton’s method for finding the zero of the function f'(x) = 0. 
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fa) = (E p) H 4e-* 4.2 oga) e — 4 log) xe. 


In order to draw the local model (only for illustration purposes) 


FO $M) = fat fant fO, 


we have to replace the expressions for the first- and second-order derivatives and the value for 
x), the point where the local model is constructed. This leads to a rather lengthy expression not 
reproduced here. In practice, the length h of the step to the minimum is easy to compute. In the 
following figures, we draw the local model (thin dark line) for two successive steps x and x“), 


x1 x2 
A starting point x in a nonconvex region leads to a nonconvex local model and the computed 
step h goes to a maximum at x“) instead of a minimum. 


i ji 
xo x! 


11.3.2 Golden section search 


The method applies to a unimodal function f(x) in the interval x € [a, b], 


‘eae @o*, 


. , 
if x>x* 


and searches the minimum by reducing the interval containing the minimum by a constant ratio? 
T= SEmi Algorithm 34 details the procedure. The graphic illustrates the first three steps when 
minimizing the function given in the example for Newton’s method. 


Listing 11.9: C-BasicMethods/M//Ch11/GSS.m 


1| function c = GSS(f,a,b,tol) 

2|% GSS.m -- version 2006-05-26 

3|% Golden Section Search 

4| if nargin == 3, tol = le-8; end 

5| tau = (sqrt(5) =- 1)/2; 

6|xl = a + (1-tau)*(b-a); f1 = feval(f,x1); 
7x2 =a + tau (b-a); £2 = feval(f,x2); 
8| while (b-a) > tol 

9 if f1 < £2 

10 b = x2; 

11 x2 = x1; £2 = £l; 


5. The procedure is similar to the bisection algorithm in which this reduction is 1/2. 


Basic methods Chapter | 11 245 


12 xl = a + (1-tau)*(b-a); f1 = feval(f,x1); 
13 else 

14 a = yi; 

15 xl = x2; El = f2} 

16 x2 =a+ tau *(b-a); £2 = feval(f,x2); 
17 end 

18| end 

9jc = a + (b-a)/2; 


Algorithm 34 Golden section search. 
1: compute xj =a + (1 — t)(b — a) and fj = f (x1) 
2: compute x) =a+t(b—a) and fo = f (x2) 
3: while (b — a) > ņ do 


if fi < fo then T T 


4: 

5: b=x7 

6: X2 =X] an 
ie 

8 

9 


h=f a x 2 b 
xj =a+(1—1T)(b—-a) 


: fi = fa) 
10: else : ; 
11: a=x] 3 


12: i a a x2 b 
13: fi=h _ 
14: xy =a+t(b—a) - 
1s: fo = f(xy) 
16: end if ae 
17: end while a xT x2 b 
asi; b = 2; 
f = @(x) 1-log(x).*exp(-x.%2); 
z = GSS(f,a,b) 
= = 

1.3279 


For unconstrained optimization, MATLAB provides the two functions fminunc and fminbnb. 
Examples for their use are given in Section 11.4.5. 


11.4 Unconstrained optimization in multiple dimensions 


11.4.1 Steepest descent method 


The negative gradient —V f (x) is locally the direction of steepest descent, that is, the function f 
decreases more rapidly along the direction of the negative gradient than along any other direction. 
Starting from an initial point x©, the successive approximations of the solution are given by 


kt) Pats) =ü Vix), 


where æ is the solution of the one-dimensional minimization problem 
min f(x zu wah 
Q 


The steepest descent method is formalized in Algorithm 35. 

Below we find a simple implementation of the steepest descent method with MATLAB. The 
gradient is computed numerically with the function numG, and the one-dimensional minimization 
is performed with 1ineSearch. The optimal œ is simply determined by moving forward with 
constant steps along the direction of the negative gradient until f starts to increase. 
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Algorithm 35 Steepest descent method. 


1: initialize x 

2: fork =0,1,2,... until convergence do 
3 compute V f (x) 

4: compute w* = argming F(2® — g V(x) 
5 x&TD =O — a* V f(x) 

6: end for 


ee 
OMAIDMNHPWN HE FOoOmMIAINDMNAHRWN Ee 


CADMNBWN Ee 


eee 
N.= ow 


Listing 11.10: C-BasicMethods/M/./Ch1 1/SteepestD.m 


function x1 = SteepestD(f,x1,tol,1s) 
% SteepestD.m -- version2010-08-31 
if margin == 2, tol = 1e-4; ls = 0.1; end 
XL = xD(¢s); xO = =-xl; k= 1; 
while ~converged(x0,x1,tol) 

KO Siy 

g = numG(f,x1); 

as = lineSearch(f,x0,-g,1s); 

xl = x0 - as * g; 

k = k + 1; if k > 300, error(’Maxit in SteepestD’); end 
end 


Listing 11.11: C-BasicMethods/M/./Ch1 1/numG.m 


function g = numG(f,x) 
% numG.m -- version 2010-08-31 
n = numel(x); g = zeros(n,1); x = x(:); 
Delta = diag(max(sqrt(eps) * abs(x),sqrt(eps))); 
FO = feval(f,x); 
fori = Len 
F1 = feval(f,x + Delta(:,i)); 
g(i) = (F1 - F0) / Delta(i,i); % forward difference 
end 


Listing 11.12: C-BasicMethods/M/./Ch1 1/lineSearch.m 


function al = lineSearch(f,x,s,d) 
% lineSearch.m -- version 2010-09-13 
if nargin == 3, d 0.1; end 


done = 0; al = 0; k = 0; 
f1 = feval(f,x); 
while ~done 
a0 = al; al = a0 + d; 
f0 = f1; f1 = feval(f,xtal*s); 
if f1 > £0, done = 1; end 
k=k+4+ 1; 
if k > 100, fprintf(’Early stop in lineSearch’); break, end 
end 


Example 11.2 


We illustrate the steepest descent method by searching the minimum of the function 
f (2X1, x2) = exp (0.1% — x2)? + 0.05(1 — *9?). 
Using as starting point x® = [—0.3 0.8] and the default tolerance for the convergence test, we obtain 


the solution x*' = [0.9983 0.9961]. If we want to get closer to the minimum [1 1], we have to use a 
lower value for the tolerance, for example, 1076. 
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£ = @(x) exp(0.1* (x(2)-x(1).%2)*%2 + 0.05* (1-x(1))%2); 


x0 = [-0.3 0.8]; 
xs = SteepestD(f,x0) 
xs = 

0.9983 

0.9961 


Fig. 11.7 reproduces the first 30 steps of the algorithm. We observe the slow (linear) convergence 
of the algorithm when reaching the flat region near the minimum. The solution x®' = [0.9983 0.9961] is 
reached after 137 steps. 


8 


x x x 
FIGURE 11.7 Minimization of f (x1, x2) = exp (0.1 (x2 — ay +0.05 (1 — x1 )°) with the steepest descent method. Right 
panel: minimization of æ for the first step (a* = 5.87). 


11.4.2 Newton’s method 


Newton’s method presented for the one-dimensional optimization in Section 11.3.1 can be gener- 
alized to the n-dimensional problem. We consider a local quadratic approximation for the function 
f (x) at x € R” using the truncated Taylor series expansion 


POEM FOF VICHE SOR. 
The first-order condition for the minimum of the quadratic model is 
V(x) + V7 f(x)h=0, 


and the step A that minimizes the local model is then the solution of the linear system 


V? f(x)h=—-Vf (x). 


Algorithm 36 Newton’s method for unconstrained optimization in n dimensions. 


1: initialize x 

2: fork =0,1,2,... until convergence do 
3: compute V f(x) and V2 fix) 

4: solve V2 f(x) s® = —V f(x) 

5 x kt) Hy 4 

6 


: end for 


The MATLAB script below is a very simple implementation of Algorithm 36 to solve the two- 
dimensional minimization of the function given in Example 11.2. The symbolic expressions for 
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FIGURE 11.8 Minimization of f (x1, x2) = exp(0.1 (x2 - Tew + 0.05 (1 — x1)? ) with Newton’s method. Contour plots 
of the local model for the first three steps. 


the first and second derivatives are computed by MATLAB’s Symbolic Toolbox. In Fig. 11.8, we 
observe the fast (quadratic) convergence of Newton’s method. 


syms xl x2 
£ = exp(0.1« (x2-x1%2) .*2 + 0.05*(1-x1)%2); 
dxi = diff (£;x1) ax? = Grif (fe, x2) > 


dxlxl = diff(£,’xl’,2); dxlx2 = diff(dxl,x2); 
dx2x2 = diff (£,'x2’,2); 
yl = [-0.30 0.80]'; 


5 


tol = le-4; k = 1; yO = -yl; 
while ~converged(y0,y1,tol) 
yO = yl; 
xl = yO(1); x2 = y0(2); 
J = [eval (dx1) eval (dx2)]; 
H = [eval (dx1x1) eval (dx1x2) 
eval (dx1x2) eval (dx2x2)]; 
s = -H \ J‘; 
yl = yO + s; 
k =k + 1; if k > 10, error(’Maxit in NewtonUOnD’); end 
end 
yl, k 
yl = 
1.0000 
1.0000 
k = 
10 


The quadratic convergence of Newton’s method occurs only if the starting point is chosen ap- 
propriately. Fig. 11.9 illustrates how the method can diverge if the local model is either nonconvex 
(left panel), or convex but leads to a large step into a nonappropriate region (right panel). 


11.4.3 Quasi-Newton method 


The exact computation of the gradient and the Hessian matrix in the Newton algorithm does not 
guarantee convergence to a solution. Two reasons are in favor of using approximated gradients and 
Hessian matrices. For one, they may be cheaper to compute and second, more importantly, they 
may contribute to a more robust behavior of the algorithm. This is, of course, at the cost of a slower 
convergence. 
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FIGURE 11.9 Nonconvex local model for starting point (left panel) and convex local model but divergence of the step 


(right panel). 


Algorithm 37 Quasi-Newton for unconstrained optimization in n dimensions. 


1: initialize x and Bo 

2: fork =0,1,2,... until convergence do 

3: solve Bp = -V f(x) 

4 s =a p® # (Line search along p®) 
5. x kt) =H 4 5%) 

6 yV =Vvfak Dy -Vf 

7: update By4, = Bk +U 

8: end for 


Algorithm 37 sketches a classical quasi-Newton method with updating of the Hessian matrix. 
One can take Bo = J, and the way matrix U is computed defines a particular method. In the 
following, we give an illustration with the popular BFGS method, developed independently by 
Broyden, Fletcher, Goldfarb, and Shanno in the seventies.° The gradient V f (x W) is evaluated nu- 
merically with the function numG already given on page 246. The updating matrix U is defined 


as 


E y® yh (Bes) (Be s®y 


a yO ss By s® 


Listing 11.13: C-BasicMethods/M//Ch11/BFGS.m 


function x1 = BFGS(f,x1,tol,1s) 
% BFGS.m -- version 2010-09-14 
if nargin == 2, tol = 1le-4; ls = 0.1; end 
gl = numG(f,x1); Bl = eye(numel(x1)); 
xl = xl(:); x0 = -xl; k = 1; 
while ~converged(x0,x1,tol) 
x0 = x1; gO = gl; BO = B1} 
p0 = -BO\g0; 
as = lineSearch(f,x0,p0,1s); 
s0 = as * p0; 
z0 = BO > s0; 
xl = x0 + s0; 
gl = numG(f,x1); 
yO = g1 - g0; 
B1 = BO + (y0*y0’)/(y0'’*s0) - (z0*z0')/(s0’*z0); 
k = k + 1; if k > 100, error(’Maxit in BFGS’); end 


= = =e en oe 
ANDNMNFWNF TOWAADAUNHPWNKE 


end 


6. For more details see Nocedal and Wright (2006). 
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FIGURE 11.10 Comparison of the first seven Newton iterations (circles) with the first five BFGS iterations (triangles). 


Fig. 11.10 compares the behavior of the BFGS method and Newton’s method for the previ- 
ously solved problem. We observe that Newton’s method takes, in this case, first a wrong direction 
and then converges very rapidly once it is in an appropriate region. The first step of the BFGS 
method corresponds to a steepest descent move, and then the steps progress more slowly toward 
the solution. 


11.4.4 Direct search methods 


In order to be successful, that is, assuring rapid and global convergence, gradient-based methods 
require f to be nicely behaved, a situation which in practice is not always satisfied and calls for 
alternative approaches. In particular, we consider problems where the function f to be optimized 
falls into one of the following categories: 


e Derivatives of f(x) are not available, or do not exist. 

e Computation of f(x) is very expensive, that is, is obtained by a huge amount of simula- 
tions. 

e Values of f are inherently inexact or noisy as is the case when they are affected by Monte Carlo 
variance. 

e We are only interested in an improvement of f rather than a fully accurate optimum. 


This class of problems is suitable to be approached with direct search methods. 

Direct search methods were first suggested in the 1950s and developed until mid-1960s (Hooke 
and Jeeves (1961), Spendley et al. (1962), Nelder and Mead (1965)) and have been considered as 
part of mainstream optimization techniques. By the 1980s, direct search methods became invisible 
in the optimization community but remained extremely popular among practitioners, in particular, 
in chemistry, chemical engineering, medicine, etc. In academia, they regained interest due to the 
work by Torczon (1989). Current research is undertaken by Torczon (1997), Lagarias et al. (1999), 
Wright (2000), Frimannslund and Steihaug (2004), etc. Similar algorithms are multidirectional 
search and pattern search. 

The idea of the simplex-based direct search method introduced by Spendley et al. (1962) is 
as follows: the objective function f is evaluated at the vertices of a simplex in the search space. 
The simplex evolves toward the minimum by constructing a new simplex obtained by reflecting 
the original simplex away from the vertex with the largest value of f. Fig. 11.11 gives an intuition 
on how the starting simplex on the right evolves by reflecting the vertex with the highest function 
value in the direction of lower values. 
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FIGURE 11.11 Evolution of starting simplex in the Nelder—-Mead algorithm. 


The coordinates X of the simplex for an n-dimensional function are defined as 


Odi dy --- dh 
_ Od dy >- dy :. fied di = s(/n+1+n—-1)/vV2) 
nx (n+l) Deeg dy = s(n +1- 1)/(n v2) 
0d do --- di 


and s is the length of an edge. In Fig. 11.12, we see the starting simplex with the vertices x, x, 
and x®); its reflection x“, x, and x*; and a further reflection in the direction of descent. 


FIGURE 11.12 Detailed starting simplex and two reflections. 


We now detail the rules according to which the simplex evolves toward the minimum. Consider 
a function depending on n variables. Then, at iteration k, the simplex is defined by the vertices x, 
i=1,...,n+ 1 (left panel in the following figure). These vertices are renamed so that f(x“) < 
f(a) <--- < f(x) (right panel). We then compute the mean over all vertices except the 
worst (the one with the highest function value): 


ls y 
pee G) j 
j=) x i=l,...,n. 
n 


i=l 
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fx) fx) 


£(x®) f(x) 


The vertex x,41 with the worst function value is now reflected through the mean x of the 
remaining points. This reflection is computed as 


xP = (Lt p)x—pxOr), 


If the function value at the reflected vertex satisfies f (x®) < f (x), that is, is better than all 
remaining points, we expand the reflection further until x, which is computed as 


x] = (14+ p)x® — px. 


Note that when computing the expansion, we do not check whether the function goes up 
again. 


If the reflection point satisfies f(x®) < f(x") but f(x®) > f(x), that is, if the func- 
tion value goes up again but is not the worst in the new simplex, we compute an out-contraction 
as 


xO = (1+ wp)d— wx”. 


F(x®) ; 


f(x) f(x(®)) 


OE XD oon, x” 


If the function at the reflection point is going up so that f(x®) > f(x@*+)), we compute an 
in-contraction as 
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xO =(1-y pš +y px"tD, 


, f(x‘) - (x!) 


f(x) y f(x®) 


Finally, if outside or inside contraction results in no improvement over f(x“"t!), we shrink the 
simplex: 


x9 =x 6 GO) i=2,...n+1. 


f(x") 


f(x) 


F(x) 


Typical values for the different parameters in the construction mechanism are: p = 1, y = 1/2 
and o = !/2. These parameters are very robust and generally one has not to tune them. Algorithm 38 
summarizes the Nelder—Mead simplex direct search method. In order to keep the presentation com- 
pact, we do not repeat how the new vertices are computed but simply use their notation. 


Algorithm 38 Nelder—Mead simplex direct search. 


1: construct vertices x“), ...,x@+D of starting simplex 

2: while stopping criteria not met do 

3 rename vertices such that fx) << f(x@tD) 

4. if f(x®) < fx) then 

5: if fx) < f@®) then x* =x© else x* =x®) 

6 else 

7 if f(x®™) < f(x) then 

8 x*=x®) 

9 else 
10: if fx®) < fa@tY) then 
ll: if f(x) < f(x@t)) then x* =x© else shrink 
12: else 
13: if fA) < fa") then x* =x else shrink 
14: end if 
15: end if 
16: end if 
17: if not shrink then x“@+)) = x* # Replace worst vertex by x* 


18: end while 


MATLAB function £fminsearch implements the Nelder—Mead algorithm. Below we illustrate 
its use by minimizing the function f(x) = a? +x2 — 11)? + (xı + a —7)* and choosing x = 
[0 — 2)’ as the starting point. 
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F = @(x) (x(1)*2 + x(2)-11)*%2 + (x(1) + x(2)%2 - 7)%2; 
x0 = [0 -2]; 
options = optimset(’fminsearch’ ); 
[x,£,FLAG, output] = fminsearch(F,x0,options) 
x = 
3.5844 -1.8481 
f = 
2.0449e-08 
FLAG = 
al 
output = 
struct with fields: 


iterations: 64 
funcCount: 124 
algorithm: ‘'Nelder-Mead simplex direct search’ 
message: ‘Optimization terminated: the current x 
satisfies the termination criteria using OPTIONS.Tol1xX 
of 1.000000e-04 and F(X) satisfies the convergence 
criteria using 
OPTIONS. TolFun of 1.000000e-04’ 


If F is defined by a function with arguments al, a2,..., then we use a function handle, and 
the call is 


fminsearch(@F,x0,options,al,a2,...) 


11.4.5 Practical issues with MATLAB 
Unconstrained optimization in MATLAB 


In the following, we illustrate how the unconstrained optimization problems of some of the pre- 
ceding examples can be solved with the MATLAB function fminunc.’ We initialize the default 
options for fminunc and call the function without providing the gradient. 


f = @(x) exp(0.1+*(x(2)-x(1) .*2)%2 + 0.05*(1-x(1))%2); 
x0 = [-0.3 0.8]; 

options = optimset(’fminunc’) ; 

xs = fminunc(f,x0,options) 


Local minimum found. 


Optimization completed because the size of the gradient is 
less than the selected value of the optimality tolerance. 


<stopping criteria details> 


xs = 
1.0000 1.0000 


To provide the gradient, we use the function FnG which has two output arguments, that is, the 
function value and the gradient and we adjust the options correspondingly. 


options=optimoptions(’fminunc’, ’SpecifyObjectiveGradient’,true); 
[xs, fs] = fminunc(@(x) FnG(x,f),x0,options) 


7. Refer to MATLAB’s help for a complete description of fminunc. 
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Local minimum found. 


Optimization completed because the size of the gradient is less than the 
default value of the optimality tolerance. 


<stopping criteria details> 


XS = 
1.0000 1.0000 
fs = 
1.0000 


Listing 11.14: C-BasicMethods/M/./Ch11/FnG.m 


l| function [f,g] = FnG(x, func) 
2|% FnG.m -- version 2010-12-21 
3) £ = func(x); 

4| g = numG(func,x); 


We observe the flat region around the minimum as the function value is reasonably close to 
its minimum, whereas the function argument is not. To improve the precision of the solution, we 
further modify the options of the algorithm. 


options=optimoptions(’fminunc’, 'SpecifyObjectiveGradient’,... 
true, ’TolFun’,1le-14, 'MaxFunEvals’,300,’TolX’,1e-8); 
[xs,fs,exflag,output,g] = fminunc(@(x) FnG(x,f),x0,options) 


Local minimum found. 


Optimization completed because the size of the gradient is less 
than the selected value of the optimality tolerance. 


<stopping criteria details> 


xs = 
1.0000 1.0000 
fs = 
1.0000 
exflag = 
1 
output 
struct with fields: 


iterations: 11 
funcCount: 13 
stepsize: 4.7982e-06 
lssteplength: 1 
firstorderopt: 0 
algorithm: ‘quasi-newton’ 
message: ‘Local minimum found. Optimization completed 
because the size of the gradient is less 
than the selected value of the optimality 
tolerance. Stopping criteria details: Optimization 
completed: The first-order optimality measure, 
0.000000e+00, is less than options. 
OptimalityTolerance = 1.000000e-14. 
Optimization Metric. Options 
relative norm(gradient) = 0.00e+00 
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OptimalityTolerance = le-14 (selected) 


Solution for nonlinear systems of equations in MATLAB 


The system specified in Example | 1.5 can be solved as follows: 


yO = [1 5]’; 
yO = fsolve(’ex2NLeqs’ ,y0) 


ys = 


11.5 Nonlinear Least Squares 
11.5.1 Problem statement and notation 


The optimization problem arising in the case of linear Least Squares has already been introduced 
in Section 3.4. The optimization of nonlinear Least Squares is a special case of unconstrained 
optimization. We have the same objective function as in the linear case 


m 


g(x) = roro = Dr 7, (11.5) 


i=1 


but the residuals 
ri(x)=yi — fi, x), i=l,...,m, 


are nonlinear as “the model” f (t;, x) is a nonlinear function with t; being the independent variables 
and x € R” the vector of parameters to be estimated. 

In order to write the local quadratic approximation (see page 247) for the minimization of 
Eq. (11.5), we need the first and second derivatives of g(x). The first derivative writes 


m 


Vax) = Yo rE) Vri(x) = Vr(x)/r(x), 


=l 
where 


anw a) 


Ox] OXn 
Vr(x) = 
arm (x) on orm (x) 
Ox] OXn 
is the Jacobian matrix. The vector 
or; (x) 
Ox] 
Vri(x) = 
or; (x) 
OXn 


corresponds to the ith row of the Jacobian matrix. The second derivative is 
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m 


Vo= D (Wile): Veil)! +r): Prw) 
i=l 


= Vr(x) Vr (x) + S(x), (11.6) 


with S(x) = XL ri(x)V77; Œ). 
We now consider me (x), the quadratic approximation of g(x) in a neighborhood of xe 


1 
me(x) = 8 (xc) + Vg (x) (x = xc) + z“ = X) V g) — Xc), 


and search the point x} = xe + Sy satisfying the first-order condition Vme(x+) = 0 so that x, is 
the minimum of me. We have 


Vine(x4) = Vg (xe) + V? g(x) (x4 — xe) =0, 
——— 
SN 


where sy is the Newton step computed by solving the linear system 
V? g (xe) su = —V g (xe). 


The minimization of the function defined in Eq. (11.5) could then be done with Newton’s 
method in which the kth iteration is defined as 


Solve Vga) 5 = —Ve(x) 
xO) — pO 4 5, 


However, in practice, we proceed differently as the evaluation of S(x) in Eq. (11.6) can be very 
difficult or even impossible. This difficulty is circumvented by considering only an approxima- 
tion of the matrix of second derivatives V*g(x). The way this approximation is done defines the 
particular methods presented in the following sections. 


11.5.2 Gauss—Newton method 


This method approximates the matrix of second derivatives (Eq. (11.6)) by dropping the term S(x). 
S(x) is composed by a sum of expressions 7; (x)V2r;(x) and, therefore, in situations where the 
model is close to the observations, we have small residuals leading to a matrix S(x) with relatively 
small elements. Algorithm 39 summarizes the method. 


Algorithm 39 Gauss—Newton method. 


1: initialize x 

2: fork =0,1,2,... until convergence do 

3: compute Vr (x) 

4: Solve (vray 'vra®)) 50 = vr Pyrat) 


5: update x@+) =x% 4 5 
6: end for 


Note that the linear system defining the Gauss—Newton step 5) is a system of normal equations 
and, therefore, at each iteration, we solve a linear Least Squares problem. According to our earlier 
suggestion concerning the linear Least Squares problem, we generally prefer to consider this an 
overidentified system 


Vr(x®) s® xX —r(x), 


which is then solved via QR factorization. Of course, convergence is not guaranteed, in particular, 
for large residuals. 
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11.5.3 Levenberg—Marquardt method 


In case Gauss—Newton method does not converge, or the Jacobian matrix does not have full rank, 
Levenberg—Marquardt method suggests approximating the matrix S(x) by a diagonal matrix uI. 
This leads to Algorithm 40. 


Algorithm 40 Levenberg—Marquardt method. 


1: initialize x 

2: fork =0,1,2,... until convergence do 

3: compute Vr (x) and Lk 

4: solve (Vra®y'vra®) ote ux) 5 = —vr(xy'r(x) 


update xEtD = 4 s® 


6: end for 


wm 


The step s® is the solution of a linear Least Squares problem and is computed by solving the 
overidentified system 


Vr(x®) On r(x) 
1/2 SM aaa i 
ey 0 


where we resort to QR factorization avoiding the product Vr(x Or (x), 

In Statement 3, the parameter u is adjusted at every iteration. However, in practice, a constant 
u = 107? appears to be a good choice. For u = 0, we have the special case of Gauss-Newton 
method. With an appropriate choice of jz, the Levenberg—Marquardt method appears very robust 
in practice and is, therefore, the method of choice in most of the specialized software. Again, as 
always in nonlinear problems, convergence is not guaranteed and depends on the starting point 
chosen. 


Example 11.3 


We introduce a small example to illustrate the methods just presented. Consider the data where ¢ and y 


t 00 10 20 3.0 
y 20 07 03 01 


are the independent and dependent variables, respectively. The variable y is assumed to be explained 
by the model f(t, x) = xı e°?! where x1, x2 are the parameters to be estimated. The vector of residuals 


is then 
yi— xe" 
y2 — xp er? 
y3 — x1 eB 


r(x)= 


ya — x em 
Fig. 11.13 plots the observations and the model for parameter values x; = 2.5 and x2 = 2.5. The right 
panel shows the shape of the objective function g(x) which is the sum of squared residuals to be 
minimized. The function is only locally convex. 
The Jacobian matrix writes 


ory (x) ary (x) 


Ox] 0x2 ert x,t} enn 
Org (x) ðr (x) t at 
Tae ge —eR2 yy fy e222 
Vr(x)= = 
ðr3 (x) ðr3 (x) — e223 —x] t3 e%2 6 
OX] 0x2 
ðr4(x) ðr (x) —e*2n4 —x] t4 e% "4 


OX] 0x2 
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FIGURE 11.13 Left panel: Plot of observations, model for x; = 2.5, x2 = 2.5 and residuals. Right panel: plot of sum of 
squared residuals g(x). 


and the first derivatives are 


dg(x) S 54 ri(x)e2 i 
vgs ° | =VrQ'r@) = aaa 
oe Eine ne 


The m matrices forming S(x) are 


aria) ari) 
2 ðx?  Əx1ðx2 (0) —te 
Viri(x) = = 
Prix) Pria) =H? —y] Pe" 
0x1 0X2 axs 


X2 ti 


and the Hessian matrix of second derivatives of g(x) is 


ti e%2 f 


4 4 
Le ti)2 $x t; (er2 %)2 Fi A 
Veg(x) =| i -J rno) 


4 4 

= 
Sox lt ert)? Vai ti e”21i)? 
i = 


ter2% xy t? enti 


With the MATLAB code given below, we compute for the particular starting point xe = [1.5 —1.7] 
one step for the Newton, Gauss-Newton, and Levenberg—Marquardt methods. 


y= [2° 0.7 0.3 0.1]’'; 
t = [0 1.0 2.0 3.0)’; 


xc = [1.5 -1.7]’; % Staring value 

m = 4; n= 2; % number of observations and parameters 
v = exp( xc(2)*t ); 

r=y = xec(1l)*v; 

J = [-v -xe(1)«*«t.*v]; 


nablagxc =J" +r; P= J'xJ; 
S = zeros(n,n); 
for i=l1:m 
S=S + r(i)ev(i) « [ O -t(i); -t(i) -xe(1)*t(i)%2]; 


end 
H = P+ S; % Exact Hessian 
% -- Newton step 

sN = - H \ nablagxc; 


xnN = xc + SN; 

-- Gauss-Newton step 

H = P? 

sGN = - H \ nablagxc; 

xnGN = xc + sGN; 

-- Levenberg-Marquardt step 
mu = 2; M = sqrt(mu) *«eye(n); 


oe 


oe 
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FIGURE 11.14 Starting point (bullet), one step (circle) and contour plots of local models for Newton (upper right), Gauss— 
Newton (lower left) and Levenberg—Marquardt (lower right). 


sLM = [J; M] \ [r’ zeros(1,n)]’; 


xnLM = xc - SLM; 


o 
5 


fprintf(’\n Newton 
fprintf(’\n Gauss-Newton 
fprintf(’\n Levenberg-Marquardt 


Newton step: 
Gauss-Newton step: 
Levenberg-Marquardt step: 


ke 
[ 1. 
[ 4. 


step: [%6.3£ 
step: [36.3f 
step: [S6.3£ 


990 -3.568] 
996 -0.330] 
692 -1.636] 


%6.3£]’,xmnN); 
%6.3£]’,xnGN) ; 
$6.3£]\n’,xnLM); 


Fig. 11.14 shows the starting point (bullet) and the step (circle) for the three methods. We observe 
in the right upper panel that the local model for the Newton step is not convex and the minimum is a 
saddle point. In the lower left panel, we have the results for the Gauss-Newton method. The local model 
is convex, but its minimum falls into a region outside of the domain of definition of g(x). Finally, in the 
lower right panel, we have the local model for the Levenberg—Marquardt method, which is convex and 
has its minimum in a position that is appropriate to progress toward the minimum of g(x). 

For the particular step given in the illustration, Levenberg—Marquardt is the only method that works. 
This, of course, is not always the case. We wanted to highlight that using the exact Hessian matrix with 
Newton's method does not at all guarantee an efficient step. 


11.6 Solving systems of nonlinear equations F(x) = 0 


11.6.1 General considerations 


Contrary to the case with linear systems, the solution of nonlinear systems is a delicate matter. 
Generally, there is no unique solution and no guarantee to converge to the solution, and the com- 
putational complexity grows very fast with the size of the system. To illustrate the problem, we 


consider a system of two equations 


bA! 


-y2 — ⁄2=0 


4y? —y.+c=0. 
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FIGURE 11.15 Solutions for varying values of parameter c of the systems of two nonlinear equations. 


Depending on the value of the parameter c, there are between O and 4 solutions in the domain 
considered. The different situations are illustrated in Fig. 11.15. 
A system of n nonlinear equations is generally specified in its implicit form using the notation 


fi(y) =0 
F(y)=0 = : ; (11.7) 
fn(Qy) =0 
where at least one of the equations f;, i = 1,2,...,, must be nonlinear. The equations of such a 
system, in particular if it is large, may often be rearranged such that the Jacobian matrix 
əfi ... Of 
oy] OYn 
VFQ)= TETE (11.8) 
Ofn ... Ofn 
oy] OYn 


can be put into a block triangular form, that is, showing a pattern like 


m 


with the matrices on the diagonal being indecomposable. In such a situation, the solution of the 
system of equations consists in solving a sequence of recursive and interdependent systems.® In the 
following we suppose the system to be interdependent, that is, indecomposable. 

In economics, we often find a different notation: 


h(y,z) =0, 


8. Refer to page 49 for how to compute the block triangular decomposition. If the decomposition is an upper block triangular 
matrix, the sequence of recursive and interdependent systems is solved from bottom to top. 
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where y € R” are the endogenous variables and z € R” the exogenous variables. In the previous 
notation, the exogenous variables are not explicit but incorporated into the function F. The Jacobian 
matrix for this notation is 4//ay’. 

A variety of methods may be considered to solve systems of nonlinear equations. The following 
sections discuss fixed point methods, Newton’s methods, and minimization. 


11.6.2 Fixed point methods 


Jacobi, Gauss-Seidel, and SOR methods 


The Jacobi, Gauss-Seidel, and SOR methods presented in Section 3.2 can be extended for the 
solution of nonlinear systems of equations (Algorithm 41). In general, the equations are normalized, 
that is, they are written in the form 


Vi = Bi (V1, eera Vi-1) Vit 1s + Yn Z)s al osii 


where the equations g; can now also be nonlinear. 
For the Jacobi method, the generic iteration writes: 


(k+1) (k) (k) | (k) k ; 
Yi =gi(yi ov e Iren zh a= N F 


With the Gauss-Seidel method, the generic iteration i uses the i — 1 updated components of 
y“+) as soon as they are available: 


k+1 (k+1 k+l k k . 
ys = gi(y lege VS aha oe), i=1,...,n. 


The stopping criterion is identical to the one used in the linear case, that is, iterations are stopped 
if the following condition is satisfied 


yt? — y| 
—p <6 i=1,2,...,n, 

Iy; land 
where € is a given tolerance. 

As explained earlier for the linear case, convergence depends on the spectral radius of the matrix 
M~—!N defined on page 41. In the nonlinear case, the matrices M and N change from iteration to 
iteration and we cannot predict their quantification at the solution of the system. 

The Jacobi method is part of the class of fixed point methods, formalized as 


y=80), 


Algorithm 41 Jacobi, Gauss-Seidel, and SOR for nonlinear systems. 


1: initialize yD, yO, € and maximum number of iterations 

2: while “converged (yO, yD, £) do 

3 y) = yD) # Store precedent iteration in yO) 
4: compute y® with Jacobi, Gauss-Seidel, or SOR 

5 check number of iterations 

6: end while 


where g is the iteration function corresponding to a particular normalization, and the fixed point 
iteration is written 


yer) = g(y), k=0,1,2,..., 
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with y being the starting solution. The condition for convergence is 


oy] y= ysol ayn Yn= ysol 
ff 7 E Yn 
sol sol ` 
p(vso®) <1 Vea]: 7 a 
38n Le. O8n 
Oy yi=yeel Oy, y=" 


which as mentioned earlier is useless in practice as the Jacobian matrix has to be evaluated at the 


solution y 


Re = = = 
U N= CO MAAIADMNFWN KE 


sol 


Example 11.4 


We illustrate the fixed point method by solving the following system of nonlinear equations representing 
a straight line and a circle: 


yı 
fiO y2): yı +y2—3=0 (0) 
fai, yo): y +y -—9=0 


F(y)=0 & 


iN 


and verifying the two solutions y=[0 3] and y=[3 0]. The MATLAB code below solves the system 
with the fixed point method. The iteration functions, that is, the normalized equations, are stored in 
matrix G in the form of character strings. For larger systems of equations, it would be preferable to 
evaluate the iteration functions in Statements 7 and 8 in a loop. The sequence of solutions has been 
saved by means of Statements 4 and 10. These statements are otherwise not necessary. 


Listing 11.15: C-BasicMethods/M/./Ch11/FPInD.m 


% FPInD.m -- version 2010-12-23 
G = str2mat(*<y0(2) + 3°, *sqrt(-y0(1)*2 + 9)*)?7 
yl = [2 0]; yO = -yl; tol = le-2; k = 1; itmax = 10; 
Y = NaN(itmax,2); Y(1,:) = yl’; 
while ~converged(y0,y1,tol) 
yO = yl; 
yl1(1) = eval(G(1,:)); 
y1(2) = eval(G(2,:)); 
k=k+ 1; 
Elk) = yi; 
if k > itmax, error(’Iteration limit reached’); end 
end 
yl 


Fig. 11.16 shows how the algorithm evolves during the iterations for the starting solution yO = 
[1 5]’. If we consider a different starting solution yO=[2 OJ’, the solutions oscillate between y =[0 07 
and y=[3 3]' without converging. This is illustrated in Fig. 11.17. 


11.6.3 Newton’s method 


Newton’s method for the solution of a system of nonlinear equations is a generalization of Newton’s 
method for finding the roots of a one-dimensional function presented in Section 11.1.5. The fact 
that an algorithm for a one-dimensional problem can be efficiently generalized to an n-dimensional 
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aos Iter Solution Error 
7 
k ye y” e yn ae = ysol 
0:17 0 1.000 5.000 1.000 2.000 
yi 1 —2.000 2.828 2.000 0.172 
2 0.172 2.236 0.172 0.764 
3 0.764 2.995 0.764 0.005 
4 0.005 2.901 0.005 0.099 
2 5 0.099 3.000 0.099 0.000 
6 0.000 2.998 0.000 0.002 
2.2 2.8 5 7 0.002 3.000 0.002 0.000 
Y2 


FIGURE 11.16 Steps of the fixed point algorithm for the starting solution y0 =[1 5]. 


Iter Solution Error 

k i yP y” = yt yP _ ysol 
0 2.000 0.000 2.000 3.000 
1 3.000 2.236 3.000 0.764 
2 0.764 0.000 0.764 3.000 
3 3.000 2.901 3.000 0.099 
4 0.099 0.000 0.099 3.000 
5 3.000 2.998 3.000 0.002 
6 0.002 0.000 0.002 3.000 
7 3.000 3.000 3.000 0.000 
8 0.000 0.000 0.000 3.000 
9 3.000 3.000 3.000 0.000 
10 0.000 0.000 0.000 3.000 


FIGURE 11.17 Oscillatory behavior of the fixed point algorithm for the starting solution yO=[2 OJ. 


problem is an exceptional situation in numerical computation. Newton’s method is also often 
termed Newton—Raphson method. 

The solution y* of a system of nonlinear equations defined in (11.7) is approximated by the 
sequence { y yko, 1,2,.... Given y® €e R” and an evaluation of the Jacobian matrix 


oft at oft 
k k 
ay yay) dyn Yn=y? 
k 
VF) = 
dfa fn 
; k k 
991 lyy Yn ly, yl 


we construct an improved y“*!) by approximating F (y) in the neighborhood of y“) with the “local 
model” 


F(y) © F(y®) 4 VEFQ™)(y — y®). 
—KKKgyr—cm 


local model 


This local model is solved for y to satisfy 
F(yY) + VFO (y — y®) =0 


which then results into 


-1 
y=y® -(VFO®)) FO®). 
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The procedure is repeated for the new value of y. At iteration k, we have 
-1 
ytd = y® _ (vFo®)) FO), 


and y“+ is the solution of the linear system 


VEGY) (yt) — y®) =- FOV). 
SS a S) a 
J sS b 


Algorithm 42 summarizes Newton’s method for the solution of a system of nonlinear equations. 


Algorithm 42 Newton’s method for nonlinear systems. 


1: initialize yO 

2: fork =0,1,2,... until convergence do 

3 compute b = —F(y®) and J = VF(y) 
4 verify condition of J 

5 solve Js=b 

6: y+) = y® +s 

7: end for 


Example 11.5 


We illustrate Newton’s method by solving the system of equations presented in Example 11.4. The 
Jacobian matrix is 


yı yay y2 y= 1 1 
VFO®)= ae z=» | , 
aY ; © 5. 
pa dh 2yi 2y2 
ayy (k) 3y © 


Choosing y® = [1 5] for the starting solution, we have 


3 1 1l 
F(y) = d  VFQ)= . 
(yr) P an O`) p a 


11) @_} -3), 
2 10 -17 


we get s® = [-13/8 —11/s]' from where we compute 


5/8 
yO ay 45 | a 


Solving the linear system 


29/8 


and update the function and the Jacobian 


an_]| 0 ay | 1 1 
FO he VF EN 


We again consider the linear system 


1 1 sO = 0 
—5/4 29/4 —145/32 |’ 
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the solution of which is s“ = [145/272 —145/272]', and we get the solution for the second iteration 


—25 /272 
@) _ (1) Gd) pe 
= +s = . 
y y | 841 


The following MATLAB code represents a very simple way to solve this particular problem. 


Feq = str2mat(’y0(1) + y0(2) - 3’,’y0(1)*%2 + y0(2)^2 - 9'); 
yl = [1 5]’; yO = -yl; tol = 1e-2; k = 1; itmax = 10; 
Y = NaN(itmax,2); Y(1,:) = yl’; 
while ~converged(y0,y1,tol) 

yO = yl; 

F(1) = eval(Feq(1,:)); 

F(2) = eval(Feq(2,:)); 

b = -F’; J = [1 1; 2*y0(1) 2*y0(2)); 

s= J \ b; 

yl = yO + s; 

k =k +1; 

Yk, s) E AL’ 3 


if k > itmax, error(’Iteration limit reached’); end 
end 
y1 
yl = 

-0.0000 

3.0000 


Fig. 11.18 shows the steps of the algorithm for the starting solution yO = [1 5], and we notice 
the faster convergence. Indeed, the convergence rate of Newton’s method is quadratic as we verify 


k WA 
(k) 
Yi ysl 


(k+1) sol 


Yi =r S 


Iter Solution Error 
k k k k 
y k yt ) ys ) yí Ja yo ys )_ ye! 
0 1.000 5.000 1.000 2.000 
1 —0.625 3.625 0.625 0.625 
2 —0.092 3.092 0.092 0.092 
3 —0.003 3.003 0.003 0.003 
3.09 3.62 5 
Y2 


FIGURE 11.18 Steps of the Newton algorithm for starting solution yO=[1 5f. 


A more general code is given with the MATLAB function NewtonnD.m where the system to be 
solved and the starting point are the input arguments. The Jacobian is computed numerically with the 
function numJ. F(y) of the system to be solved is coded with the function ex2NLegs .m. 


Listing 11.16: C-BasicMethods/M/./Ch11/NewtonnD.m 


l| function x1 = NewtonnD(f,x1,tol,itmax) 

2|% NewtonnD.m -- version 2010-09-12 

3| if nargin < 3, tol = le-2; itmax = 10; end 
4\x1 = x1(:); x0 = -x1l; k= 0; 

5| while ~converged(x0,x1, tol) 

6 xO = xl} 

7 F = feval(f,x0); 

8 b = -F; J = numdJ(f,x0); 
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9 s=Jd\b; 
10 xl = x0 + s; 
11 k=k+ 1; 
12 if k > itmax, error(’Maxit in NewtonnD’); end 
13| end 
Listing 11.17: C-BasicMethods/M/./Ch1 1/ex2NLeqs.m 
l| function F = ex2NLeqs (x) 
2|% ex2NLeqs.m -- version 2010-12-23 
3) F = zeros(numel(x),1); 
4| F(1) = x(1) + x(2) - 3; 
S|}F(2) = x(1)*2 + x(2)%*2 - 9; 


Listing 11.18: C-BasicMethods/M//Ch1 1/numJ.m 


1] function J = numd(f,x) 

2|% numJ.m -- version 2010-09-16 

3|n = numel(x); J = zeros(n,n); x = x(:); 

4| Delta = diag(max(sqrt(eps) * abs(x),sqrt(eps))); 
5| F = feval(f,x); 

6| for j = i:n 

7 Fd = feval(f, (x + Delta(:,j))); 

8 J(:,j) = (Fd - F) / Delta(j,j); 

9| end 

yO = [1 5]; 


ys = NewtonnD(’ex2NLeqs’,y0) 


MATLAB provides the function £solve to find the solution of a nonlinear system (see Sec- 
tion 11.4.5). 


Convergence 


Under certain conditions, Newton’s method converges quadratically. In practice, these conditions 
cannot be checked in advance. We have 


b+) _y*] < By | y™—-y* l, k=0,1,2,..., 


ll y 


where 6 measures the relative nonlinearity || VF (y*)~! ||< £ < 0 and y is the Lipschitz constant. 
Moreover, convergence is only guaranteed if the starting point y lies in a neighborhood of the 
solution y*, where the neighborhood has to be defined. In the case of macroeconomic models, the 
starting point y is naturally defined by the solution of the preceding period, which in general 
constitutes a good neighborhood. 


Computational complexity 


The computational complexity of Newton’s method is determined by the solution of the linear sys- 
tem. If we compare the fixed point methods with Newton’s method, we observe that the difference 
in the amount of computations comes from the evaluation of the Jacobian matrix and the solution 
of the linear system. This might appear to be a disadvantage for Newton’s method. However, in 
practice, we observe that if we use a sparse direct method, the complexity of the two methods is 
comparable as the number of iterations for Newton’s method is significantly below the one for the 
iterative methods. 
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11.6.4 Quasi-Newton methods 


Newton’s method necessitates at every iteration the evaluation of the Jacobian matrix, which in 
the case of a dense system requires the computation of n? derivatives, and the solution of a lin- 
ear system, which is O(n). In order to avoid the computation of the Jacobian matrix at every 
iteration, one might think to replace these computations by an inexpensive update of the Jacobian 
matrix. This gives rise to a variety of variants for Newton’s method called quasi-Newton meth- 
ods. 


Broyden’s method 

This method updates the Jacobian matrix by means of rank one matrices. Given an approxima- 
tion B of the Jacobian matrix at iteration k, the approximation for iteration k + 1 is computed 
as 


(dF® — B® sV) sie 
+ 


(k+l) p% 
ad ERORO i 


where dF“) = Fyd) — F(y®) and s“) is the solution to Bs = —F(y). Broyden’s 
algorithm (Algorithm 43) is formalized hereafter. 


Algorithm 43 Broyden’s method for nonlinear systems. 


1: initialize yO) and B® (an approximation for V F (y)) 
2: fork =0,1,2,... until convergence do 

3 solve B® 5) = —F(y) 

4: ykt) = yh) + 59%) 

5 dF) = F(y&tD) — FOV) 

6 BET) = B® 4 (dF) — Bk) ss V (KY s®) 
T: 


end for 


Example 11.6 


Again we consider the system of Example 11.4 but use Broyden’s method to solve it. First, we start with 
an identity matrix as approximation for the Jacobian matrix and the algorithm converges in 7 iterations 
from y0 = [5/2 1] to the solution y* = [0 3] as can be seen in Fig. 11.19. 

Next, we take the Jacobian matrix evaluated at the starting point for our initial B® matrix. In this 
case, the algorithm converges faster, that is, in 5 iterations, but to a different solution which is y* = [3 0]. 


Iter Solution Error 
k y“ yP y” 2 ysol i = ig 
0 2.500 1.000 0.500 1.000 
y, 1 2.000 2.750 1.000 2.750 
2 1.163 1.524 1.837 1.524 
3 —0.361 2.995 3.361 2.995 
4 0.199 2.808 2.801 2.808 
5 0.017 2.986 2.983 2.986 
-0.36 6 —0.001 3.001 3.001 3.001 
1 T.52 275 299 7 0.000 3.000 3.000 3.000 
J2 


FIGURE 11.19 Steps of the Broyden algorithm for starting point yO = [5/2 1] and identity matrix for BO), 
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3.41 
Iter Solution Error 
k k k k 
2.98 k yí ) ys ) yt ) yo ys dy _ yo 
yi i 0 2.500 1.000 0.500 1.000 
1 3.417 —0.417 0.417 0.417 
25 2 2.883 0.117 0.117 0.117 
3 2.985 0.015 0.015 0.015 
4 3.001 —0.001 0.001 0.001 
b] 3.000 0.000 0.000 0.000 
—0.41 0.01 0.11 1 
Y2 


FIGURE 11.20 Steps of the Broyden algorithm for starting point yO = [5/2 1]/ and Jacobian matrix evaluated at the 


starting point for BO), 


This is illustrated in Fig. 11.20. The MATLAB code for the Broyden algorithm and the call to it is given 


below. 


Listing 11.19: C-BasicMethods/M/./Ch11/Broyden.m 


l| function yl = Broyden(f,y1,B,tol,itmax) 
2|% Broyden.m -- version 2010-08-10 
3}n = numel (y1); 
4| if nargin < 3, B = eye(n,n); tol = le-2; itmax = 10; end 
5|y1 = yl(:); yO = -yl; k= 0; 
6|F1 = feval(f,yl1); 
7| while ~converged(y0,y1,tol) 
8 yO = yl; 
9 FO = F1; 
10 s = B \ -FO; 
11 yl = y0 +s; 
12 Fl = feval(f,y1); 
13 dF = F1 - FO; 
14 B= B+ ((dF - B*s)*s’)/(s’*s); 
15 k=k+ 1; 
16 if k > itmax, error(’Iteration limit reached’); end 
17| end 
yO. = [5/2 “11; 
ys = Broyden(’ex2NLeqs’,y0) 
yens 
0.0000 
3.0000 
B = numJ(’ex2NLeqs’,y0); 
ys = Broyden ('ex2NLeqs',y0,B,1e-2,10) 
ys = 
3.0000 
0.0000 


11.6.5 Further approaches 
Damped Newton 


If the starting point is far from the solution, Newton’s method and its variants generally do not 
converge. This is due to the fact that the direction and, in particular, the length of the step are 
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FIGURE 11.21 Objective function minimizing || F(y)||2 for the solution of the system of nonlinear equations defined in 
Example 11.4. 


highly unreliable. In order to construct at iteration k a more conservative step, yer), can be com- 
puted as 
yEtD = y® 4 a” s®, 

where a) is a scalar to be determined and s“ is a regular Newton step. Thus we set 0 <a < 
1 if y is far from the solution and a“) = 1 if we get near the solution. One way to monitor 
the parameter a*) is to link its value to the value of || F (y )||. This approach is called damped 
Newton. 

A more sophisticated method to maintain the Newton step in the appropriate direction is the 
so-called trust region method which essentially consists in estimating the radius of a region within 
which the Newton step is constrained to stay. For more details, see, for example, Heath (2005). 


Solution by minimization 


To solve F(y) = 0, one can minimize the following objective function 


80) =|| FO) Ilp; 


where p can be any norm in R”. One reason motivating this approach is that it provides a decision 
criterion about whether y“*+!) constitutes a better approximation to the solution y* than y%®. As 
the solution satisfies F (y*) = 0, we may compare the norm of the vectors F (y&tD) and F (y®), 
What we want is 


I Fo) Ip < | FO™) lp, 


which can be achieved by minimizing the objective function 


1 
min g(y) = ~ F(y)' F(y) 
y 2 


if we choose p = 2 for the norm. This minimization of a sum of squares can be solved with 
the Gauss—Newton or Levenberg—Marquardt algorithm presented in Sections 11.5.2 and 11.5.3. 
Fig. 11.21 shows the objective function defined as || F (y) ||2 of the system of nonlinear equations 
defined in Example 11.4. 


11.7 Synoptic view of solution methods 


The solution of linear systems of equations is a basic component in the methods presented in this 
chapter. Indeed, when we use a classical optimization technique, solve Least Squares problems, lin- 
ear or nonlinear systems of equations, we have to find the solution of a linear system or a sequence 
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FIGURE 11.22 Synoptic view of solution methods for systems of linear and nonlinear equations. 


of linear systems. Several approaches have been presented, and it might be clarifying to give a 
synthetic view and indicate the possible combinations among these methods. Fig. | 1.22 attempts 


to provide this overview. The leaf nodes with the block method indicate that for 


each block in turn, 


we have a choice for the solution and, therefore, they link to the nodes |L] and 


NL}, respectively. 


Similarly for Newton’s method, we link to the node |L] for the solution of the linear system in the 


successive steps. 
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Chapter 12 


Heuristic methods in a nutshell 
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In 1957, Herbert Simon famously conjectured that within 10 years, computer chess programs would 
be written that surpass the best human players (Coles, 1994). It took longer: only in 1997 scored 
IBM’s Deep Blue a tournament victory against then world champion Garry Kasparov (Campbell 
et al., 2001). What is important here is not that Simon (who was well aware of human limitations 
in forecasting) underestimated the required time, but the “philosophy” with which Deep Blue ap- 
proached the problem. The aim for Deep Blue was not to emulate human players, their strategies 
and decision rules, in particular in pattern recognition. Deep Blue beat Kasparov because with its 
specialized hardware it could evaluate more than 100 million board positions in a second, which 
allowed it to do a deep search among possible moves (Campbell et al., 2001, Hsu, 2007); it beat 
Kasparov by sheer force. 

This book is not about chess (though choosing a move in a game is an optimization problem). 
But the story is instructive nevertheless. It mirrors the start of a paradigm shift that takes place 
in optimization, away from subtle mathematical theory towards simpler techniques, which in turn 
require fewer assumptions. The price to pay: much more computing power is needed. But “much 
more” is a relative term—your desktop PC suffices. 


12.1 Heuristics 


In this chapter, we will outline the basic principles of heuristic methods and summarize several 
well-known techniques. These descriptions are not meant as ultimate references (all the presented 
techniques exist in countless variations), but they are meant to demonstrate the basic rules by which 
these methods operate. In the following chapters we will discuss the implementation of several 
techniques. Indeed, if you are more practically inclined, you may as well jump directly to the 
tutorial in Chapter 13. 
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What is a heuristic? 


The term heuristic is used in various scientific fields for different, though often related, purposes. 
In mathematics, it is used for derivations that are not provable, sometimes even formally false, 
but lead to correct conclusions nonetheless. The term was made famous in this context by George 
Pólya (1957). Psychologists use the word heuristics for simple rules of thumb for decision making. 
The term acquired a negative connotation through the works of Daniel Kahneman and Amos Tver- 
sky in the 1970s, since their heuristics-and-biases program involved a number of experiments that 
showed the apparent suboptimality of such simple decision rules (Tversky and Kahneman, 1974). 
More recently, however, an alternative interpretation of these results has been advanced; see, for 
example, Gigerenzer (2004, 2008). Many studies indicate that while simple rules underperform in 
stylized experimental settings, they yield surprisingly good results in more realistic situations, in 
particular, in the presence of noise and uncertainty (e.g., estimation error). Curiously, substantial 
strands of literature in different disciplines document the good performance of simple methods 
when it comes to prediction, and judgment and decision making under uncertainty. Points in case 
are forecasting (Makridakis et al., 1979, Makridakis and Hibon, 2000, Goldstein and Gigerenzer, 
2009); see specifically the so-called M-competitions; econometrics (Armstrong, 1978); psychology 
and decision analysis (Dawes, 1979, 1994, Lovie and Lovie, 1986); and machine learning (Holte, 
1993, Hand, 2006). Still, within its respective discipline, each of these strands represents a niche. 
Such a strand developed in portfolio optimization in the 1970s; see, for instance, Elton and Gruber 
(1973) and Elton et al. (1978). These papers aimed at justifying the use of computationally feasible, 
but simplifying techniques. The problem then was to reduce computational cost, and the authors 
tried to give empirical justification for these simpler techniques. Today, complicated models are 
feasible, but they are still not necessarily better. 

The term heuristic is also used in computer science. Pearl (1984, p. 3) describes heuristics as 
methods or rules for decision making that (i) are simple, and (ii) give good results sufficiently often. 
Among computer scientists, heuristics are often related to research in artificial intelligence; some- 
times specific methods such as Genetic Algorithms are put on a level with, say, Neural Networks in 
the sense that both are computational architectures that solve problems. In this book, we will define 
heuristics in a narrower sense. We will always stay in the framework and language of optimization 
models, and our fundamental problem will be 


minimize f (x, data) 
x 


where f is a scalar-valued function and x is a vector of decision variables. Recall that by switching 
the sign of f we can make it a maximization problem. In most cases, this optimization problem 
will be constrained. 

In fact, we have found it helpful not to think in terms of a mathematical description, but rather 
something like 


solutionQuality = function(x,data). 


That is, we only need to be able to write down (to program) a mapping from a solution to its quality, 
given the data. Heuristics, in the sense the term is used in this book, are a class of numerical methods 
that can solve such problems. Following similar definitions in Zanakis and Evans (1981), Barr et al. 
(1995), and Winker and Maringer (2007b), we characterize the term optimization heuristic through 
several criteria: 


e The method should give a “good” stochastic approximation of the true optimum; “goodness” 
can be measured in computing time and solution quality. 

e The method should be robust to changes in the given problem’s objective function and con- 
straints, and also to changes in the problem size. Furthermore, results should not vary too much 
with changes in the parameter settings of the heuristic. 

e The technique should be easy to implement. 

e Implementation and application of the technique should not require subjective elements. 
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Such a definition is not unambiguous. Even in the optimization literature we find different char- 
acterizations of the term heuristic. In operations research, heuristics are often not regarded as 
stand-alone methods but as workarounds for problems in which “real” techniques like linear pro- 
gramming do not work satisfactorily; see, for instance, Hillier (1983). Or the term is used for ad 
hoc adjustments to “real” techniques that seem to work well but whose advantage cannot be proved 
mathematically. We will not follow this conception. Heuristics as defined here are general-purpose 
methods that can, as we will show, handle problems that are sometimes completely infeasible for 
classical approaches. Still, even though there exists considerable evidence of the good performance 
of heuristics, they are still not widely applied in research and practice. 

In a broad sense, we can differentiate between two classes of heuristics: constructive methods 
and iterative search methods. Constructive methods build new solutions in a stepwise procedure. 
An algorithm starts with an empty solution and adds components iteratively. Thus, the procedure 
terminates once we have found one complete solution. An example for this approach comes from 
the Traveling Salesman Problem. Solution methods exist where we start with one city and then add 
the remaining cities one at a time until a complete tour (i.e., one solution) is created. An example 
for a constructive method from finance is described in Example 13.1 on page 325. 

For iterative search methods, the algorithm moves from solution to solution, that is, a complete 
existing solution is changed to obtain a new solution. These new solutions need not always be 
close to the previous ones, as some methods (e.g., Genetic Algorithms) are discontinuous in their 
creation of new solutions. Hence, a new solution may be quite different from its predecessor; it 
will, however, usually share some characteristics with it. 


Iterative search 


In this book, we will only consider iterative search methods. For such a method, we start with one or 
several solutions, and then modify these solutions until a stopping criterion is satisfied. To describe 
such a method, we need to specify: 


(i) how to generate new solutions (i.e., how we modify existing solutions), 
(ii) when to accept such a modified solution, and 
(iii) when to stop the search. 


These three steps summarize the basic idea of an iterative search method. Clearly, such a description 
is very broad; it includes even many of the classical techniques discussed in the previous chapter. 
As an example, think of a steepest descent method of Section 11.4.1. Suppose we have an initial 
(current) solution x°, and we want to find a new solution x". Then the rules could be as follows: 


(i) We estimate the slope (i.e., the gradient) of f at x° which gives us the search direction. The 
new solution x” is then x° — y V f(x‘), where y is a step size. 
(ii) If f(x") < f(x‘), then we accept x", that is, we replace x° by x”. 
(iii) We stop if no further improvements in f can be found, or if we reach a maximum number of 
function evaluations. 


Problems will mostly occur with rules (i) and (ii), as we explain next. Problems may occur with 
rule (iii), too. They are mostly caused by round-off error, but can, at least for the applications in 
this book, be avoided by careful programming; see Chambers (2008, Chapter 6), for a discussion. 

As was noted already in Chapter 10, there are models in which the gradient does not exist, or 
cannot be computed meaningfully (e.g., when the objective function is not smooth). Hence, we 
may need other approaches to compute a search direction. The acceptance criterion for classical 
methods is strict: if there is no improvement, a candidate solution is not accepted. But if the ob- 
jective function has several minima, this means we will never be able to move away from a local 
minimum, even if it is not the global optimum. 

Iterative search heuristics follow the same basic pattern (i)—(iii), but they have different rules 
that are better suited for problems with noisy objective functions, multiple minima and other prop- 
erties that may cause trouble for classical methods. In this sense, heuristics can be traced back to 
a class of numerical optimization methods that were introduced in the 1950s: direct search meth- 
ods. This is not to say that there is a direct historical development from direct search to heuristic 
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methods. Contributions to the development of modern heuristics came from different scientific dis- 
ciplines. Yet, it is very instructive to study direct search methods. First, they are still widely applied, 
in particular the Nelder—Mead algorithm. In MATLAB®, it is implemented in the function fmin- 
search, and in R in the function optim (it is described in Section 11.4.4). Second, and more 
important for us, they share several of the characteristics of heuristic methods: they are simple, 
easy to implement, and simply work well for many problems. 

In the previous chapter, we already looked into a direct search technique, Nelder—Mead search. 
Nelder—Mead search shares one feature with modern heuristics: it does not compute the search 
direction in a theoretically optimal way; rather, a good search direction is chosen. In fact, many 
heuristic techniques rely even less on good search directions. However, Nelder—Mead still has 
“classical” features. It has a strict acceptance criterion, that is, a new solution is only accepted if it 
is better than the current solution; and the method is deterministic, so from a given starting point 
it will always move to the same solution. If this solution is only a local minimum, Nelder—Mead 
will get stuck. Most heuristics thus add strategies to overcome local minima; one such strategy is 
randomness. 

In the following sections we will describe several well-known heuristics. To keep the presen- 
tation simple we only differentiate between single-solution methods and multiple-solution (a.k.a. 
population-based) methods, both terms to be explained below. More detailed classification systems 
of heuristics can be found in Talbi (2002) and Winker and Gilli (2004). 


12.2 Single-solution methods 


As their name suggests, single-solution methods evolve a single solution through (many) iterations. 
As this solution is gradually changed by the algorithm, the solution follows a path, a trajectory, 
through the search space, which is why these methods are sometimes called trajectory methods. 


12.2.1 Stochastic Local Search 


Assume that we have again a current solution x°, and wish to obtain a search direction to compute 
a new solution. But now instead of computing the slope of the objective function or of reflecting a 
simplex as in Nelder—Mead, we use a much simpler mechanism. We randomly select one element 
in x°, i.e. typically one decision variable, and change it slightly. How do we change it? Again, 
randomly; by adding a little noise. If this new solution is better than x°, it replaces it; otherwise, 
we keep x° as it was. This is the basic idea of a local search, or more precisely, a stochastic local 
search. 

The concept of local search is not new, but the technique was not regarded as a complete method 
in the literature. Rather, it was considered a component within other techniques, for example, as 
a safeguard against saddle points (Gill et al., 1986, p. 295). This reluctance is understandable: if 
gradient-based methods can be applied, local search will be grossly inefficient since it ignores the 
information that the derivatives of the objective function provide. This inefficiency will always 
remain—on a relative basis. But absolute computing time for given problems has declined so much 
in recent decades that local search could become a central building block of various methods. 

Unfortunately, the term local search is ambiguous, as it is used in the literature with very dif- 
ferent meanings. In this section, we will outline a specific algorithm for local search. Whenever we 
refer to this particular algorithm, we shall write Local Search, in title case. Later on, we will see 
other algorithms, specifically Simulated Annealing and Threshold Accepting, that build on local 
search. We refer to such algorithms, or more generally to the class of algorithms that build on local 
search, as local-search algorithms. This may appear confusing, but fear not; it will always be clear 
from the context what is meant. 

Local Search starts with a randomly chosen feasible solution. We call it the current solution 
and label it x°. Then Local Search picks, again randomly, a new solution x" close to x°. This new 
solution x" is called the neighbor solution. If it is better than x°, the new solution is accepted 
and replaces x°; if not, it is rejected. This procedure is repeated many times over. Local Search 
requires us to provide an objective function, a neighborhood function, and a stopping criterion. The 
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latter will simply be a preset number of steps. Algorithm 44 summarizes the procedure. In fact, 
Local Search could be called a direct search method, even though it is rarely described as such 
in the literature. And there are differences: unlike Nelder—Mead, Local Search does not exploit an 
accepted search direction; and it is not deterministic. 


Algorithm 44 Local Search. 

1: set Nsteps 

2: randomly generate current solution x° 

3: for i = 1: steps do 

4: generate x" € N(x°) and compute A= f(x") — f(x‘) 
5: if A<0O then x°=x° 
6 
7 


: end for 
: return x° 


In a sufficiently well-behaved setting, Local Search will, for a suitable neighborhood definition 
and enough iterations, succeed in finding the global minimum. The compensation for its lack of 
efficiency is that Local Search only requires that the objective function be evaluated for a given 
solution x; there is no need for the objective function to be continuous or differentiable or well- 
behaved in any other sense. Unfortunately, Local Search will, like direct search methods, stop at 
the first local optimum it encounters. However, repeatedly restarting the algorithm, even with the 
same initial solution, will normally produce different results. So there is at least a chance of not 
getting trapped and of finding better solutions. We discuss Local Search and show applications in 
Chapters 13 and 14. Hoos and Stützle (2004) provide a detailed discussion of the technique and its 
many variants. 

Heuristic methods that build on Local Search employ further strategies to avoid getting trapped 
in local minima. One common feature is the acceptance of solutions that do not lower (i.e. improve) 
the objective function, but actually increase it. The heuristics described in the following sections 
all share this feature. 


12.2.2 Simulated Annealing 


The probably best-known single-solution method is Simulated Annealing (SA), introduced in Kirk- 
patrick et al. (1983). SA was conceived for combinatorial problems, but can easily be used for 
continuous problems as well. Algorithm 45 provides pseudocode of the procedure. 


Algorithm 45 Simulated Annealing. 
1: set T ## initial temperature level 
2: set nemp ## number of temperature levels 
3: Set Ngeps ## Steps per temperature level 
4: randomly generate current solution x° 
5: setx* = x° 
6: for r= 1: nemp do 
7 
8 
9 


for i = 1: Ngteps dO 
generate x" € N (x°) (neighbor to current solution) 
compute A= f(x") — f (x°) and generate u (uniform random variable) 


10: if (A <0) or (e~4/7T >u) then x° =x" 
11: if f(x°) < f(x*) then x* = x° 

12: end for 

13: reduce T 

14: end for 


15: return x* 


Like Local Search in the previous section, SA starts with a random solution x° and creates a 
new solution x" by adding a small perturbation to x°. If the new solution is better than the current 
one (A < 0), it is accepted and replaces x°. In case x” is worse, SA does not reject it right away, but 
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applies a stochastic acceptance criterion, thus there is still a chance that the new solution will be 
accepted, albeit only with a certain probability. This probability is a decreasing function of both the 
order of magnitude of the deterioration and the time the algorithm has already run. This time factor 
is controlled by the temperature parameter 7, which is reduced over time. Hence, impairments in 
the objective function become less likely accepted and, eventually, SA resembles Local Search. 
The algorithm stops after a predefined number of iterations. 

The acceptance criterion can be specified in different ways. In Algorithm 45, we have used 
the so-called Metropolis function for which the probability of accepting an inferior solution is 
min(1,e~4/7) . Graphically, the probability as a function of A is: 


1 


0.5 


j 0 A 


There are alternatives: the Barker criterion TET: for instance, completely blurs the distinction 
between improvement and deterioration of a solution (Schuur, 1997): 


0.5: 


12.2.3 Threshold Accepting 


Threshold Accepting (TA) is similar to SA. Algorithm 46 shows that the two methods only differ 
in their acceptance criterion (Line 10 in Algorithm 45; Line 9 in Algorithm 46). Indeed, both SA 
and TA are sometimes called threshold methods. 


Algorithm 46 Threshold Accepting. 


: set threshold sequence t 
: Set Mihresholas ## length of t 
: Set Nsteps ## Steps per threshold 
: randomly generate current solution x° 
set x* = x° 
forr=1: N thresholds do 
for i = 1: Ngteps do 
generate x" € N(x°) and compute A= f(x") — f(x‘) 
if A <t, then x°=x" 
if f(x) < f(x*) then x* = x° 
end for 
: end for 
: return x* 
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Whereas in SA solutions that lead to a higher objective function value are accepted stochasti- 
cally, TA accepts deteriorations unless they are greater than some threshold qt,. The 7,ounas thresholds 
decrease over time, hence like SA the algorithm turns into Local Search. Graphically, the probabil- 
ity of accepting a solution as a function of A is: 


1 
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Threshold Accepting was introduced by Dueck and Scheuer (1990); Moscato and Fontanari (1990) 
suggested the same deterministic updating rule in SA and called it “threshold updating.” 

Why was TA suggested at all? The comparison with SA is mixed, as there seems to be no clear 
winner in optimization performance (Moscato and Fontanari, 1990). But at the time when TA was 
suggested, it also came with a sizeable reduction in computing time: Johnson et al. (1989) report 
that evaluating e*, as required for SA, was an expensive computation, taking up almost one-third 
of the overall computation time. 

For an in-depth description of TA, see Winker (2001). We will discuss the implementation of 
TA for portfolio optimization in Chapter 14. 


12.2.4 Tabu Search 


Most heuristics differ from classical methods by introducing an element of chance. In methods like 
SA or TA, for example, we picked neighbor solutions randomly. Tabu Search (TS), at least in its 
standard form, is an exception. It is deterministic for a given starting value. TS was designed for 
discrete search spaces; it is described in Glover (1986), Glover and Laguna (1997), and detailed 
in Algorithm 47. Its strategy to overcome local minima is to keep a memory of recently visited 
solutions. These are forbidden (tabu) as long as they stay in the algorithm’s memory. In this way, a 
TS can manage to walk away from a local minimum as it is temporarily not allowed to revisit that 
solution. 


Algorithm 47 Tabu Search. 
1: initialize tabu list T = Ø 
randomly generate current solution x° 
while stopping criteria not met do 
compute V = {x|x E N (x°)}\T 
select x" = argmin;yjxey} fŒ) 
x©=x" and T=TUx" 
update memory, update best solution x* 
end while 
return x* 
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12.3 Population-based methods 


For single-solution methods, the key mechanism for escaping local minima is to temporarily allow 
uphill moves; the methods do not enforce an improvement of the objective function in every itera- 
tion. Population-based methods employ the same principle, but they do so by maintaining a whole 
collection of different solutions at a time, some of which are worse than others. Population-based 
methods are often better at exploration than single-solution methods, that is, they are often good at 
identifying favorable regions in the search space. 


12.3.1 Genetic Algorithms 


The best-known technique in this category are Genetic Algorithms (GA). Genetic Algorithms were 
described by John Holland in the 1970s (Holland, 1992); pseudocode can be found in Algorithm 48. 
GA are inspired by evolutionary biology, so the procedure appropriately starts with a population of 
solutions; the objective function becomes a fitness function (to be maximized); iterations become 
generations. In a standard GA, solutions are coded as binary strings like 01110001. 
Such a string may be a binary representation of an integer or real number, but in many discrete 
problems there is a more natural interpretation: in a selection problem, a 1 may indicate a selected 
item from an ordered list, a 0 may stand for an item that does not enter the solution. New candidate 
solutions, called children or offspring, are created by crossover (i.e. mixing existing solutions) and 
mutation (i.e. randomly changing components of solutions): 
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Two parents Original solution 
01110001 01110001 
...and children ... and mutant 


0111 xa 01110ffo1 
MM) o 001 


(A) Crossover (B) Mutation 


To keep the population size #{ P} constant, a selection among parents and children takes place at 
the end of each generation (the function “survive” in the code). Many variations exist; for example, 
only the #{ P} fittest solutions may stay in P, or the survival of a solution may be stochastic with the 
probability of survival proportional to a solution’s fitness. Depending on the selection mechanism, 
the currently best member of the population may become extinct. In such a case, we may want to 
keep track of the best solution over time. 


Algorithm 48 Genetic Algorithm. 


1: randomly generate initial population P of solutions 


2: evaluate P and store best solution x* 

3: while stopping criteria not met do 

4: select P’ C P (mating pool), initialize P” = Ø (set of children) 
5: fori=1tondo 

6: randomly select individuals x* and x” from P’ 

T: apply crossover to xè and x? to produce x™!4 

8: randomly mutate produced child x‘t"4 

9: p" = p” VU ychild 
10: end for 
11: update best solution x* 


12: P=survive(P’, P”) 
13: end while 
14: return x* 


12.3.2 Differential Evolution 


A more recent contribution to population-based methods is Differential Evolution (DE; Storn and 
Price, 1997). Algorithm 49 gives the pseudocode. DE evolves a population of np solutions, stored in 
real-valued vectors of length d. Thus, solutions are represented in DE in a way that is immediately 
appropriate for continuous problems. The population P will be handled as a matrix of size d x np; 
each column holds one candidate solution, each row gives the values that one particular decision 
variable takes on in the population. In every iteration (or generation) k, the algorithm goes through 
the columns of this matrix and creates a new candidate solution for each existing solution PO . Such 
a candidate solution is constructed by taking the difference between two other solutions, weighting 
this difference by a scalar F, and adding it to a third solution. Then an element-wise crossover takes 
place between this auxiliary solution pe and the existing solution BY. In the pseudocode, rand 
represents a random variable that is uniformly distributed between zero and one; the crossover 
probability is CR. If this final candidate solution py is better than PỌ, it replaces it; if not, the 


old solution po is kept. By construction, the best solution will always be kept in the population. 
The algorithm stops after ng generations. 

In its standard form as described here, DE randomly chooses solutions to be mixed and crossed. 
This particular chance mechanism also means that in any given generation, potential changes to a 
given solution come from a finite set of possible moves. Practically, this simply means we need a 
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Algorithm 49 Differential Evolution. 


1: set np, ng, F and CR 


2: randomly generate initial population Pa, J= leesd tH, 2154 Np 
3: for k = 1 tong do 

4 pO) — pd) 

5 for i = 1 tonp do 

6: randomly generate £1, £2, £3 € {1,..., np}, 0) £ l2 403 Ai 
7 compute P9 = pY +Fx eo — PY) 

8 for j = 1 to d do 

9: if rand < CR then ee = pe) else pr = p? 
10: end for 


li: if ¢(P\) < f(P)) then PË = P™ else PO = PY 
12: end for 
13: end for 


14: return best solution 


sufficiently large population (Storn and Price suggest ten times the number of decision variables, 
that is, 10d); we may also introduce additional randomness. Many variations of DE are described 
in Price et al. (2005). We will use DE in Chapters 16 and 17. 


12.3.3 Particle Swarm Optimization 


The metaphor for GA and DE was evolution, but for Particle Swarm Optimization (PSO) it is flocks 
of birds that search for food (Eberhart and Kennedy, 1995). Like DE, PSO is directly applicable to 
continuous problems; the population of np solutions is again stored in real-valued vectors. In each 
iteration, a solution is updated by adding another vector called velocity v;; see Algorithm 50. We 
can picture a solution as a position in the search space, and velocity as the direction of movement of 
this solution. Velocity changes over the course of the optimization. At the start of each generation 
the directions towards the best solution found by the particular solution, Pbest;, and the best overall 
solution, Pbestgpes;, are determined. The sum of these two directions—which are the differences 
between the respective vectors; see Line 7—are perturbed by multiplication with a uniform random 
variable u; and a constant c. The vector so obtained is added to the previous v;; the resulting 
updated velocity is added to the respective solution. The algorithm stops after ng generations. 


Algorithm 50 Particle Swarm Optimization. 


1: set np, Ng and c4, C2 
(0) 


: randomly generate initial population pe and velocity v;",i=1,...,mp 


2 

3: evaluate objective function F; = FeO), i=1l,...,np 
4: Phest = P, Fbest = F, Gbest = min; (F;), gbest = argmin; (F;) 
5: fork = 1 tong do 
6: for i = 1 to np do 
7 

8 

9 


Av; = cı Uy (Pbest; — pe) + c2 U2 (Pbestgbest — pe) 


yl iy Oe Av; 

: pe = ee + vf) 
10: end for 
11: evaluate objective function F; = FP), i=1l,...,np 
12: for i = 1 to np do 
13: if F; < Fbest; then Pbest; = pe and Fbest; = F; 
14: if F; < Gbest then Gbest = F; and gbest = i 
15: end for 
16: end for 


17: return best solution 


PSO is used in Chapter 16 for robust regression. 


ar 
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12.3.4 Ant Colony Optimization 


Ant Colony Optimization (ACO) was introduced in the early 1990s by Marco Dorigo in his PhD 
thesis; see Dorigo et al. (1999), for an overview. The algorithm is inspired by the search behavior 
of ants (Goss et al., 1989). 

ACO is applicable to combinatorial optimization problems which are coded as graphs. A solu- 
tion is a particular path through the graph. ACO algorithms start with a population of ants, each 
of which represents an empty solution (thus, ACO has elements of a constructive method). The 
ants then traverse the graph, until each ant has constructed a solution. The ants’ way is guided by a 
pheromone trail: at each node, there is a probability to move along a particular edge; this probability 
is an increasing function of the pheromone (a positive number) that is associated with the specific 
edge. Initially edges are chosen with equal probability. After each iteration, the quality of the solu- 
tions is assessed; edges that belong to good solutions receive more pheromone. In later iterations, 
such edges are more likely to be chosen by the ants. Algorithm 51 summarizes the procedure. 


Algorithm 51 Ant Colony Optimization. 
1: set parameters 
2: while stopping criteria not met do 
3 randomly place ants in graph 
4 for each ant do 
5: construct solution (tour) 
6 
T 
8 


evaluate solution 
end for 
save best solution x 
9: update trails 
10: end while 
11: return best solution 


sol 


12.4 Hybrids 


The heuristics presented in the previous section can be considered as general procedures applicable 
to a wide range of problems, but they all have their particular features. Hybrid heuristics are made 
up by assembling components from different heuristics. The construction of hybrid heuristics can 
be motivated by the need to achieve a trade-off between the desirable features of specific heuris- 
tics. A more general notion of a hybrid heuristic would also allow for combining a heuristic with 
classical optimization tools such as direct search. Nevertheless our advice is to always start with a 
“standard” heuristic and switch to a hybrid method only in exceptional situations. 

As the components of different heuristics can be combined in numerous ways, it might prove 
useful to give a structured view about the possibilities of constructing hybrids. We first classify 
the heuristics according to their main features in a more fine-grained way as done in the previous 
section and then provide a simple scheme for possible combinations. ! 


Trajectory (single-solution) methods 


The current solution is slightly modified by searching within the neighborhood of the current solu- 
tion. This is typically the case for threshold methods and Tabu Search. 


Discontinuous methods 


Full solution space is available for the new solution. The discontinuity is induced by genetic opera- 
tors (crossover, mutation) as is the case for Genetic Algorithms, Differential Evolution, and Particle 
Swarm Optimization and which corresponds to jumps in the search space. 


Single-agent methods 


One solution per iteration is processed. This is the case for threshold methods and Tabu Search. 


1. This presentation builds on Talbi, 2002, Taillard et al., 2000, and Birattari et al., 2001. 
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Multi-agent or population-based methods 


A population of searching agents contributes to the collective experience. This is the case for Ge- 
netic Algorithms, Ant Colonies, Differential Evolution, and Particle Swarm Optimization. 


Guided search or search with memory usage 


Incorporates some additional rules and hints on where to search. In Genetic Algorithms, Differ- 
ential Evolution, and Particle Swarm Optimization, the population represents the memory of the 
recent search experience. In Ant Colony Optimization, the pheromone matrix represents an adap- 
tive memory of previously visited solutions. In Tabu Search, the tabu list provides a short-term 
memory. 


Unguided search or memoryless methods 


Relies perfectly on the search heuristic. This is the case for threshold methods. 


Talbi (2002) suggests a classification combining a hierarchical scheme and a flat scheme. The 
hierarchical scheme distinguishes between low-level and high-level hybridization and within each 
level we distinguish relay and co-evolutionary hybridization. Low-level hybridization replaces 
a component of a given heuristic by a component from another heuristic. In the case of high- 
level hybridization, different heuristics are self-contained. Relay hybridization combines different 
heuristics in a sequence whereas in co-evolutionary hybridization the different heuristics cooperate. 

For the flat scheme we distinguish the following hybridizations: (i) homogenous hybrids where 
the same heuristics are used and heterogenous hybrids where different heuristics are combined; 
(ii) global hybrids where all algorithms explore the same solution space and partial hybrids which 
work in a partitioned solution space; (iii) specialist hybrids combine heuristics solving different 
problems whereas in general hybrids the algorithms all solve the same problem. Fig. 12.1 illustrates 
this hierarchical and flat classification. 


@ Homogeneous 


@ Global 
è Partial 


@ Special 


FIGURE 12.1 Scheme for possible hybridizations. 


A few examples might demonstrate the construction of hybrids following the two-level hierar- 
chical and flat scheme. 


Low-level relay hybrid 


As an example we could consider a Simulated Annealing where a neighbor x" is obtained as fol- 
lows: Select a point x' in the larger neighborhood of x° and perform a descent local search. If this 
point is not accepted (upper panel) we return to x° (not x') and continue (lower panel): 


om aad lecn po e 
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Low-level co-evolutionary hybrid 


Population-based algorithms perform well in the exploration of the search space but are weak in the 
exploitation of the solutions found. Therefore, for instance, a possible hybridization would be to 
use in a Genetic Algorithm a greedy heuristic” for the crossover and a Tabu Search for the mutation 
as indicated in the following pseudocode. 


select P’ C P (mating pool), initialize P” = Ø (children) 

: fori=1tondo 

select individuals x? and x> at random from P’ 

apply crossover to xè and x? to produce x4 (greedy algorithm) 
randomly mutate produced child xchild (Tabu Search (TS)) 

p" = Ply yild 

end for 


YO: 100. SATON: SB he 


High-level relay hybrid 
Examples are the use of a greedy heuristic to generate the initial population of a Genetic Algo- 


rithm and/or threshold method and Tabu Search to improve the population obtained by the Genetic 
Algorithm as described below. 


1: generate current population P of solutions (greedy algorithm) 
2: compute GA solution 
3: improve solution with threshold method (TM) 


Another example is the use of a heuristic to optimize another heuristic, that is, find the optimal 
values for the parameters. 


High-level co-evolutionary hybrid 


In this scheme, many self-contained algorithms cooperate in a parallel search to find an optimum. 


12.5 Constraints 


Nothing in the algorithms that we presented above ensures that a constraint on a solution x is 
observed. But it is often constraints that make models realistic—and difficult. Several strategies 
exist for including restrictions. Note that in what follows, we do not differentiate between linear, 
nonlinear, integer constraints, and so on. It rarely matters for heuristics to what class a constraint 
belongs; computationally, any functional form is possible. 

All techniques discussed before are iterative methods, that is, we move from one solution to 
the next. The simplest approach is to “throw away” infeasible new solutions. Suppose we have a 
current solution, and we modify it to get a neighbor solution. If this neighbor violates a constraint, 
we just pick a new neighbor. Clearly, this works only for stochastic choices of neighbor solutions, 
but almost all heuristics are stochastic. This strategy may appear inefficient, but if our model has 
only few constraints which are not often hit, it is often a good strategy. 

A second approach is to directly use the information of the constraint to create new solutions. 
An example from portfolio optimization (discussed in more detail in Chapter 14) is the budget 
constraint, that is, we require that all asset weights sum to one. This constraint can be enforced 
when we compute new solutions by increasing some weights and decreasing others such that the 
sum of all weight changes is zero. 


2. The term greedy optimization is mostly used for constructive methods. A greedy method constructs a solution by, in each 
iteration, taking the best possible move according to an optimality criterion. Greedy techniques do not look ahead (they are 
myopic) and, hence, cannot escape local minima. 
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An older but still used idea is to transform variables. This approach sometimes works for con- 
straints that require that the elements of x lie in certain ranges; see the discussion in Powell (1972). 
For instance, sin(x) will map any real x to the range [—1, 1]; a (sin(x))? will give a mapping to 
[0, æ]. But such transformations come with their own problems; see Gill et al. (1986, Section 7.4); 
in particular it may become difficult to change a problem later on, or to handle multiple constraints. 

Fourth, we can repair a solution that does not observe a restriction. We can introduce mech- 
anisms to correct such violations. To stick with our example, if a solution x holds the portfolio 
weights, then scaling every element j in x as *//3° x will ensure that the weights sum to unity. 

Finally, we can penalize infeasible solutions. Whenever a constraint is violated, we add a penalty 
term to the objective function and so downgrade the quality of the solution. In essence, this changes 
the problem to an unconstrained one for which we can use the heuristic. The penalty is often 
made an increasing function of the magnitude of violation. Thus, the algorithm may move through 
infeasible areas of the search space, but will have guidance to return to feasible areas. The penalty 
approach is the most generic strategy to include constraints; it is convenient since the computational 
architecture needs rarely to be changed. Penalties create soft constraints since the algorithm could 
in principle always override a penalty; practically, we can set the penalty so high that we have hard 
constraints. The preferred strategy is to start with a low penalty; if the final solution still violates 
any constraints, the penalty is increased gradually until we get a feasible solution. 

So which approach is best? Unfortunately, it depends on the problem. Often, we will use a 
mixture of these approaches. In fact, we can never know whether we have found the most efficient 
implementation. But fortunately, we do not need to. Herbert Simon’s “satisficing” gives the rule: if 
we can solve our model to sufficient precision under the given practical constraints (e.g., available 
time), we can stop searching. So which approach is good? In our experience, a good way to start 
is with penalties. Penalties are quickly implemented, and they offer other advantages: for instance, 
they work well if the search space is not convex or even disconnected, as illustrated below. 


? 


When we need a really fast implementation, repairing or using the constraints directly is likely 
more efficient. But it is also less flexible—adding further constraints or changing the problem may 
require changes in the repair mechanism—and simply requires more time for development and 
testing. Thus, again, for testing purposes and as long as a model is not put into very final form, 
penalties are a good idea. 


12.6 The stochastics of heuristic search 
12.6.1 Stochastic solutions and computational resources 


Suppose we wanted to solve an optimization problem with a naïve random sampling approach. We 
would (i) randomly generate a large number of candidate solutions, (ii) evaluate these solutions, 
and (iii) pick the best one. If we repeated the whole procedure a second time, our final solution 
would probably be a different one. Thus, the solution x we obtain from step (iii) is stochastic. 
The difference between our solution and the actual optimum would be a kind of truncation error 
since if we sampled more and more, we should in theory come arbitrarily close to the optimum. 
Importantly, the variability of the solution stems from our numerical technique; it has nothing to 
do with the error terms that we often have in our models to account for uncertainty. Stochastic 
solutions may even occur with non-stochastic methods: think of search spaces like those shown in 
Chapter 10. Even if we used a deterministic method like a gradient search, the many local minima 
would make sure that repeated runs from different starting points would result in different solutions. 

Almost all heuristics are stochastic algorithms. So running the same technique twice, even with 
the same starting values, will usually result in different solutions. To make the discussion more 
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tangible, we use again the example of portfolio selection (we discuss such models in Chapter 14). 
A solution x is a vector of portfolio weights for which we can compute the objective function 
value f(x). The objective function could be the variance of the portfolio or another function of the 
portfolio return. So, if we have two candidate solutions x and x, we can easily determine which 
one is better. 

Since our algorithms are stochastic, we can treat the result of our optimization procedures as 
a random variable with some distribution D. What exactly the “result” of a restart is depends on 
our setting. In most cases that we discuss later, it is only the objective function value (i.e., the 
solution quality) that we obtain from a single run of our technique. Alternatively, we may also look 
at decision variables given by a solution, that is, the portfolio weights. In any case, we collect all 
the quantities of interest in a vector ọ. The result ọ j of a restart j is a random draw from D. 

The trouble is that we do not know what D looks like. But fortunately, there is a simple way to 
find out for a given problem. We run a reasonably large number of restarts, each time store ọ ;, and 
finally compute the empirical distribution function of the @;, j = 1,..., number of restarts as an 
estimate for D. For a given problem or problem class, the shape of the distribution D will depend 
on the chosen method. Some techniques will be more appropriate than others and give less variable 
and on average better results. And it will often depend on the particular settings of the method, in 
particular the number of iterations—the search time—that we allow for. 

Unlike classical optimization techniques, heuristics can walk away from local minima; they will 
not necessarily get trapped. Intuitively, then, if we let the algorithm search for longer, we can hope 
to find better solutions. Thus the shape of D is strongly influenced by the amount of computational 
resources spent. One way to measure computational resources is the number of objective function 
evaluations. For minimization problems, when we increase computational resources, the mass of D 
will move to the left, and the distribution will become less variable. Ideally, when we let the com- 
puting time grow ever longer, D should degenerate into a single point, the global minimum. There 
exist proofs of this convergence to the global minimum for many heuristic methods (e.g., Gelfand 
and Mitter, 1985, for Simulated Annealing; Rudolph, 1994, for Genetic Algorithms; Gutjahr, 2000, 
Stiitzle and Dorigo, 2002, for Ant Colony Optimization; or van den Bergh and Engelbrecht, 2006, 
for Particle Swarm Optimization). But unfortunately these proofs are not much help for practical 
applications. First, they often rely on asymptotic arguments, but with an infinity of iterations even 
random sampling will eventually produce the global optimum. We can always make sure that the 
global optimum is achievable, we just need to randomly select starting values for our algorithm 
such that any feasible solution could be chosen. Then, with an infinity of restarts, we are bound to 
find the global optimum. But random sampling achieves the same theoretical guarantees, so search 
iterations that a heuristic adds would then not increase our confidence. 

Second, many such proofs are nonconstructive (e.g., Althéfer and Koschnick, 1991, for TA). 
They demonstrate that there exist parameter settings for given methods that lead (asymptotically) 
to the global optimum. Yet, practically, there is no way of telling whether the chosen parameter 
setting is correct in this sense; we are never guaranteed that D really degenerates to the global 
optimum as the number of iterations grows. 

Fortunately, we do not need these proofs to make meaningful statements about the performance 
of specific methods. For a given problem class, we can run experiments. We choose a parameter 
setting for a method, and we repeatedly solve the given problem. Each run j gives a solution with 
parameters ọ j; from such a sample we can compute an estimate for D. So when we speak of “con- 
vergence,” we always mean “change in the shape of D.” Such experiments also allow investigation 
of the sensitivity of the solutions with respect to different parameter settings for the heuristic. Ex- 
perimental results are of course no proof of the general appropriateness of a method; but they are 
evidence of how a method performs for a given class of problems, which often is all that is needed 
for practical applications. 


Heuristic methods in a nutshell Chapter | 12 287 


12.6.2 An illustrative experiment 


A simple asset allocation model 


The discussion about stochastic solutions is best illustrated by an example. Before we do so how- 
ever, we would like to make clear what this example shows, and what it does not show. There is a 
trade-off between, in general terms, search time (or computing time) and quality of a solution. We 
search for longer, we find (on average) better solutions. Quality of the solution here refers solely 
to the numerical solution of our optimization model. We need be careful for there are actually two 
relationships that we should be interested in: if we search for longer, we should at least on average 
find better solutions to our model, given our data. That is, for a typical financial application, we 
find better solutions in-sample. So, we have one trade-off between computing time and quality of 
the in-sample solution. Then, we would hope, better solutions to the model should result in better 
solutions to our actual problem (like selecting a portfolio that yields a high risk-adjusted return 
over the next month). The first relationship (in-sample) seems reasonable, though to be practically 
useful we need be more concrete; we should run experiments to estimate the speed of convergence. 
This is what we do in this example. But the second trade-off is not at all clear. There is no reason to 
assume that there exists a positive relationship between in-sample and out-of-sample quality. This 
is something that needs to be tested empirically (Gilli and Schumann, 201 1b). This could imply 
that in some cases we might not even want the optimal solution to our model. From an optimiza- 
tion viewpoint this is awkward: our model is misspecified. Thus, we should write down the model 
such that we are interested in its optimum. Yes, agreed; but how can we know? Only by empirical 
testing. This point is well understood in applications in which the model itself is chosen by some 
automatic procedure, for instance, when people train neural networks, or in model selection. It is 
rarely discussed when it comes to parameter optimization for a chosen model. 

Now let us look at a concrete example. We describe the problem informally here; code is given in 
Chapter 14. Suppose we are given a universe of 500 assets (for example, mutual funds), completely 
described by a given variance—covariance matrix, and we are asked to find an equal-weight portfolio 
with minimal variance under the constraints that we have only between Kymin and Kmax assets in 
the portfolio. Thus, we have a pure selection problem, a discrete problem. How could we compute 
a solution to this problem? Here are three possible approaches. 


1. Write down all portfolios that fulfill the constraints (i.e., have between Kmin and Kmax compo- 
nents), compute the variance of each, and pick the one with the lowest variance. 

2. Choose k portfolios randomly and keep the one with the lowest variance. 

3. Compute the variance of each asset and sort the assets by their variance. Then construct a port- 
folio with the Kmin assets with the lowest variance, then with Kmin + 1 assets, and so on until 
a portfolio of the Kmax assets with the lowest variance. Of those Kmax — Kmin + 1 portfolios, 
pick the one with the lowest variance. 


Approach (1) is infeasible. Suppose we were to check cardinalities between 30 and 100. For 
100 out of 500 alone we have 10107 possibilities, and that leaves us 30 out of 500, 31 out of 500, 
and so on. Even if we could evaluate millions of portfolios in a second it would not help. 

Approach (2) has the advantage of being simple, and we can scale computational resources (in- 
crease k). That is, we can use the trade-off between available computing time and solution quality. 
Approach (2) can be thought of as a sample substitute for Approach (1). 

In Approach (3), we ignore the covariation of assets (i.e., we only look at the main diagonal of 
the variance—covariance matrix), but we only have to check Kmax — Kmin + 1 portfolios. There may 
be cases, however, in which we would wish to include correlation. 

To see how Approaches (2) and (3) perform, we set up an experiment. We create an artificial data 
set of 500 assets, each with a volatility of between 20% and 40%; each pairwise correlation is 0.6. 
ọ in our case is the volatility (the square root of the variance) of the portfolio that our algorithm 
returns. The following figure gives estimates of D, each obtained from 500 restarts. 
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We see that completely random portfolios produce a distribution with a median of about 231/2%. 
What would happen if we drew more portfolios? The shape of D would not change, since we are 
merely increasing our sample size. The estimates of the tails would become more precise. (Why, a 
sample might even include the global minimum!) We also plot the distribution of a “best-of-1000” 
and a “best-of-100,000” strategy. With this, we get the median volatility below 21%. We also add 
the results for the sort algorithm (Approach 3). 


Local Search 


Now let us use a simple heuristic, Local Search. We start with a random portfolio and compute 
its volatility. Then we randomly pick one asset of our universe. If it is already in the portfolio, we 
remove it. If it is not in the portfolio, we put it in. We compute the volatility of this new portfolio. If 
it is lower than the old portfolio’s volatility, we keep the new portfolio; if not, we stick with the old 
portfolio. We include constraints in the simplest way: if a new portfolio violates a constraint, we 
never accept it. We run this search for 100 steps, 1000 steps, and 10,000 steps. The picture below 
shows the results. Again the distributions are computed from 500 restarts. 


1.0, a 

0.8 F 

0.6 F 

0.47 100 steps 

0.2 F 1000 steps 

10,000 steps 
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We see that the local search easily finds better portfolios than random search, and in less time. 
Compare the distribution of the “best-of-100,000” strategy with the distribution of a local search 
with 10,000 steps. The latter one needs only one-tenth of the number of objective function evalua- 
tions. 

For the local search, we picked one asset and either put it into the portfolio or removed it. How 
would the results change if we chose two assets, or five? The following figure shows results with 
10,000 steps (note that the x-scale has changed). It appears that changing only one asset at a time 
works better than changing several assets. 
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In sum, a simple strategy like local search yielded very good results for a problem whose exact 
solution could never be computed. The quality of the solution was evidently determined by the 
chosen method, by the computational resources (how many steps?), and by the particular settings 
(how many assets to change?). As a rule, computational resources have much more influence on 
the results than parameter settings. 

It will not always be so simple. In later chapters, we will discuss how we can solve more-difficult 
problems, in particular when we have constraints. 


12.7 General considerations 
12.7.1 What technique to choose? 


It is difficult to give general advice on this question. Different methods may be able to solve a 
given problem class, so one possibility would be to test which technique is best. We do not sug- 
gest pursuing such a strategy; “best” is certainly not required. What we need is a technique that 
provides sufficiently good solutions. This is, again, the satisficing rule: we define a desired quality 
for a solution, and once we have found a solution that meets this requirement, we stop searching. 
Admittedly, this is not too helpful, so here are a few more concrete points. 

First, one should start with a method that allows a natural representation of the problem. For 
instance, for a combinatorial problem, a natural technique may be a Genetic Algorithm, or perhaps 
Simulated Annealing; for a continuous problem, Differential Evolution will probably be a promis- 
ing first candidate. A second suggestion is to start simple and only add features if really required; 
hybridization in particular is not recommended, at least not at an early stage. Not because hybrids 
are not useful for certain problems, but because exploring a problem with a simpler method often 
allows a better understanding of where the difficulties lie. 

A final remark, which is valid for different methods: since heuristics are stochastic, repeated 
runs will differ in their results. Also, varying the parameters of a method will often change its 
performance, but because of the random nature of the obtained solutions it is difficult to differentiate 
meaningful differences from noise. It is easy to spend hours and days changing something here or 
there, but without getting anywhere. In our experience it is better to first spend some time on an 
efficient implementation of a model, and then to run (small) experiments, with at least 10 or 20 runs 
with given parameter settings. 


12.7.2 Efficient implementations 


Heuristics allow us to solve models that do not fulfill the requirements of classical optimization 
methods. In a sense, they achieve this by giving up mathematical sophistication. For instance, they 
do not compute optimal search directions like a gradient search, but often choose randomly. We 
gain a lot with heuristics: we are freed of many of the assumptions and requirements of classical 
optimization techniques like convexity, but at the price of more computational effort. 

Thus, heuristics are computationally intensive. Computational effort of an algorithm can be 
measured in a platform-independent way by the number of objective function evaluations it takes 
to provide an acceptable solution. This number is often large when compared with the number of 
evaluations required by a classical method. But such a comparison would be deceiving if we did not 
pay attention to solution quality: if classical methods are not appropriate, we obtain a fast solution 
by giving up solution quality (because we are likely to get stuck in local minima). This is again the 
trade-off between the time required to obtain a solution, and the (numerical) quality of this solution. 

It immediately follows that we should try to implement the components of the model efficiently. 
We need to think about how we represent the data, and, in particular, we should spend some time on 
how we compute the objective function. MATLAB and R offer several utilities to measure comput- 
ing time, from simple “stop watches” (tic and toc in MATLAB; system. time in R) to tools 
for profiling code (see profile in MATLAB; or Rprof in R). 

Speeding up computation is not a mechanical task; it requires measuring (always! don’t guess!), 
experimentation and testing. We can make a computation faster by exploiting structure in a problem 
(i.e., already in the model), but also by choosing the right way to compute a quantity. In particular, 


[fast-ma-1] 


[fast-ma-2] 
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when we work with high-level languages like MATLAB and R, operation count may give hints, but 
cannot replace experiments. As a concrete example, we look into different possibilities to compute 
moving averages in R. 

Assume we have a time series of observations y; fort = 1,..., N. We define a moving average 
of order K as 


K 
Ž k= Yt-k+1 
x i 
Note that our computation of M at t includes the value of y at t. The most basic idea is to loop over 


y, and at each f compute the mean of y; over the past K observations. This is shown in the script 
below. 


M, = (12.1) 


> N <- 1000 ## length of times series 
> K <- 50 ## order of moving average 
> y <- rnorm(N) 

> trials <- 100 


Vv 


library ("rbenchmark" ) 


> MA_mean <- function(y, K) { 
N <- length(y) 
ans <- numeric (N) 
come ma ILN) 
ans[t] <- mean(y[(t-K+1):t]) 
ans 
} 
> benchmark(MA_mean(y, K), 
replications = 500, 
Orce = relativel, Teal 


test replications elapsed relativ 
1 MA_mean (y, K) 500 LSS 1 


Using sum and dividing by K is already faster: 


> ## variant 2 -- compute mean 
> MA_sum <- function(y, K) { 
N <- length(y) 
ans <- numeric (N) 
for(t in K:N) 
cias lel <= Summ Ces 3 ic) ) AK 
ans 


> all.equal(MA_mean(y, K), MA_sum(y, K)) 


[1] TRUI 


[za] 


> benchmark (MA_mean (y, K), 
MAGES UTELO EKE 


replications = 500, 
ouden = Vaeeulereayys)) |, Li] 


test replications elapsed relativ 
2 MA_sum(y, K) 500 0.335 1.00 
1 MA_mean(y, K) 500 1.525 4.55 
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This perhaps surprising result can be explained by the fact that sum is a primitive function, whereas 
mean is a normal function; actually a generic function. A lot of the overhead could be removed by 
directly calling the default method for mean. 


> MA_mean_default <- function(y, K) { 
N <- length (y) 
ans <- numeric (N) 
iene (ie aia gN) 
ans[t] <- mean.default (y[ (t-K+1):t]) 
ans 


} 
> all.equal(MA_mean(y, K), MA_mean_default(y, K)) 


[1] TRU] 


E 


> benchmark (MA_mean (y, K), 
MA_sum (y, K), 
MA_mean_default(y, K), 
replications = 500, 
order = "relative")[, 1:4] 


test replications elapsed relativ 


2 MA_sum(y, K) 500 0.338 1.00 
3 MA_mean_default(y, K) 500 0.707 2.09 
Al MA_mean (y, K) 500 1.507 4.46 


There is actually structure in how we compute a moving average. We have 


Yt Yt-1 Yt—2 Yt-K+1 
M, = is i 12.2 
t KT K + K + K ( ) 


We can shift M, to t + 1 and “add a zero” to get 


Yt+1 yt Yt-1 Yt—-K+1 Mt-K+1 Yt—-K+1 

M, = Law 12.3 
t+1 K txt K + + K + K K ( ) 

e a 

Mı 
or 
Yt+1  Yt-K+1 

M1 = M — — —__., 12.4 
t+ t+ K K ( ) 


So we add one term and we delete one term; there is no need to recompute the whole sum at each 
time step.° In R: 


> MA_update <- function(y, K) { 
N <- length(y) 
ans <- numeric (N) 
ans[K] <- sum(y[1:K])/K 
for(t in (K+1):N) 

alae || <= exais| fecal) sp sie = sea) x 
ans 
} 
> all.equal(MA_mean(y, K), MA_mean_default(y, K)) 


[1] TRUI 


a 


3. Curiously, a number of authors and operators prefer exponential moving averages because they can be computed by 
updating; see, for instance, Schwager, 1993, pp. 158-159, or Dacorogna et al., 2001, Chapter 3. 


[fast-ma-3] 


[fast-ma-4] 
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> benchmark (MA_mean(y, K), 
MA_sum (y, K), 
MA_mean_default(y, K), 
MA_update(y, K), 
replications = 500, 
order= urelativenii alee 


test replications elapsed relativ 


4 MA_update(y, K) 500 0.056 1.00 
2 MA_sum(y, K) 500 0.330 5.89 
3 MA_mean_default(y, K) 500 0.692 12.36 
J MA_mean (y, K) 500 15.02 26.82 


These variants still require loops. Two functions come to mind that would allow a vectorized 
computation: filter from R’s stats package (a moving average is a linear filter), and cumsum. 
These functions also require loops but execute them at a lower level, usually in C code. 


{[fast-ma-5] > MA_filter <- function(y, K) 
filter(y, rep(1/K, K), sides = 1) 
> all.equal(MA_mean (y, K) [K:length(y)], 
MA_filter(y, K) [K:length(y) ]) 


[1] TRUE 


> benchmark(MA_mean(y, K), 
MA_sum (y, K), 
MA_mean_default(y, K), 
MA_update(y, K), 
MA_filter(y, K], 
replications = 500, 
order= TieSulevemywe?)) |, isa 


test replications elapsed relativ 


4 MA _update(y, K) 500 0.057 1.00 
5 MA _filter(y, K) 500 0.065 1.14 
2 MA_sum(y, K) 500 0.338 5.93 
3 MA_mean_default(y, K) 500 0.693 TZ el 
J MA_mean (y, K) 500 1.510 26.49 


cumsum seems fastest by far. 


[fast-ma-6] > MA_cumsum <- function(y, K) { 
ans <- cumsum(y) /K 
Ghats} CIN] s= Shales N = el ((O), ea NEK 1) 
ans 


} 
> all.equal(MA_mean (y, K) [K:length(y)], 
MA_cumsum(y, K) [K:length(y) ]) 


[1] TRUE 


> benchmark(MA_mean(y, K), 
MA_sum (y, K), 
MA_mean_default(y, K), 
MA_update(y, K), 
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MA_filter(y, 
MA_cumsum(y, K), 
replications = 500, 

onder = "relacivet) i TA] 


K), 


lapsed relativ 


test replications 


6 MA_cumsum(y, K) 500 0.007 100 
4 MA_update(y, K) 500 0.056 8.00 
5 MA_filter(y, K) 500 0.066 9.43 
2 MA_sum(y, K) 500 0.335 47.86 
3 MA_mean_default(y, K) 500 0.697 99.57 
1 MA_mean (y, K) 500 1.521 217.29 


Still, function MA_sum suggests a useful principle. Well-written R functions typically do much 
more than the raw computation: they check and sometimes repair inputs, and they often have to 
specifically deal with missing values or invalid data. But when we do optimization, we usually 
have full control over the data and inputs, and so we may often improve performance by writ- 
ing bare-bones implementations of particular functions. An example might be the computation of 
correlation for a fairly large matrix. 


> ne <- 200 

> x <- matrix(rnorm(2500«nc), 

> benchmark (cor (x), 
crossprod(scale(x)), 
replications = 50, 
ordern- 


iayexoyll = Ae) 


urealarive”) iy, igati 


test replications 


N 


crossprod(scale(x)) 50 


lapsed relativ 
0.848 1.00 


1 cor (x) 50 2:530 2.98 


> all.equal (cor(x), 
crossprod(scale(x) ) / (nrow(x)-1) ) 


[1] TRUI 


BJ 


As we wrote elsewhere in this book: memorizing rules cannot replace experimentation. So please 
do not memorize the examples we have shown here; instead, remember to experiment. 


12.7.3 Parameter settings 


Boyd and Vandenberghe (2004, p. 5, emphasis theirs) call a method for solving a particular problem 
“a (mature) technology, [if it] can be reliably used by many people who do not know, and do 
not need to know, the details.” As an example of such a technology, they suggest Least Squares. 
Heuristics, today, are no such technology.’ (But to be fair, it is pretty hard to come up with methods 
other than Least Squares that qualify as technologies in this sense, and even for Least Squares 
probably not everyone would agree.) 

A difficulty is the number of decisions that we have to take. For single-solution methods, for 
instance, we need to think about how to choose a neighborhood; for Genetic Algorithms, we need 
to decide what mutation operators we include, and so on. If heuristics are to become a mature 
technology, an important question is not which parameter values are optimal for a given problem 
class (or more often, a single instance of a problem), but how sensitive a method’s solutions are 
to specific choices. To put it the other way around, we would like to know classes of problems 


4. Optimization always requires testing, evaluation, checking for robustness. This is not just the case for heuristics, but for 
classical methods as well. 


[fast-cor] 
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for which a given method with a given implementation works robustly across different parameter 
settings. 


12.8 Outlook 


This chapter has given a brief overview of different heuristic methods. The presentation has not 
been complete; it could not even touch on the innumerable variations of the discussed techniques. 
Different methods can also be combined into hybrid techniques, further increasing the number of 
available methods. This actually leads to a dilemma. On the one hand, essentially all heuristics 
build on only a few, often very simple, principles, such as accepting inferior solutions. If we only 
want to understand how heuristics work, these principles are enough. Our main goal in this chapter 
has been motivation, so we have tried to keep the discussion simple and did not start with a detailed 
taxonomy of different methods or insisted that a minor modification of some method is a new tech- 
nique. It is often difficult to even clearly define when a particular algorithm is considered a method 
of its own, and not just a variant. Such a distinction is not only driven by the actual characteristics 
of a technique, but also in a path-dependent way by how well researchers and operators accept a 
given idea. Threshold Accepting, for instance, changes only one element in Simulated Annealing, 
the acceptance criterion. Strictly speaking, it does not even change it, but uses a special case. Still, 
the method has been accepted. Moscato and Fontanari (1990) suggested the same mechanism, but 
did not consider the technique a new method. Another example is Memetic Algorithms (Moscato, 
1989). A Memetic Algorithm develops a population of solutions, but each member of the popu- 
lation also searches individually via a single-solution method. The solutions then cooperate (e.g., 
by crossover), but also compete (good solutions replace inferior ones). In essence, this is a hybrid 
of a population-based procedure and a single-solution method; still, the algorithm is accepted as a 
single technique. 

But on the other hand, the principles on which heuristics are built are so general (or vague) 
that different methods based on the same principles will often perform very differently for given 
problems. It becomes useful then to define specific techniques more strictly in terms of their char- 
acteristics and “stick to the algorithm” in order to establish a common language. For instance, with 
the definition of direct search given earlier (page 242), all methods discussed in this chapter could 
be called direct search methods. But the term direct search should be used for methods like Nelder— 
Mead not because of pedantry, but because it facilitates communication between researchers and 
practitioners. As pointed out above, how well a method is accepted depends not only on its merits, 
but also on how knowledge about the method diffuses throughout communities of researchers and 
practitioners. In order to promote the proliferation of heuristic methods, it should help to have clear 
algorithms in mind, not just rough principles. Repeated successes of a method to solve problems 
of a given class (or even several classes) should then provide us with a picture of the strengths 
and weaknesses of the technique, and also increase the confidence that operators and researchers 
have into the method and, hence, its acceptance. This book aims to contribute to the acceptance 
of heuristics in finance by demonstrating their capability to solve optimization problems that are 
infeasible for classical methods. The first appendix to this chapter will show several examples of 
how different heuristics can be implemented in MATLAB. A second appendix looks into how to 
do parallel computations with MATLAB, which is in particular useful for experiments. A third 
appendix then showcases the heuristics that are implemented in the NMOF package. The remain- 
ing chapters will discuss the implementation of heuristics for specific applications from portfolio 
selection, option pricing, econometric models, and other fields. 


Appendix 12.A Implementing heuristic methods with MATLAB 


The following implementations are simple general translations of the presented algorithms. We 
have deliberately kept these implementations simple so that they may guide the reader in writing 


5. Parallel computations with R are discussed in Section 15.C. 
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his code.° The workings of the codes are illustrated with two particular problems, i.e. the mini- 
mization of a continuous function, the Shekel function and an integer problem, which is the subset 
sum problem. Some problems will be solved with different heuristics, however, this is not to be un- 
derstood as a race, since the performance of a specific heuristic is dependent on the problem type. 
What should be considered when choosing a particular heuristic is the ease at which the objective 
function and related constraints can be coded, as well as one’s personal expertise with the chosen 
method. 

The problems are presented first along with the coding of the function to be minimized and the 
scripts coding the related data. In the subsequent section, the code of the algorithms is given and 
problems are solved. The code of the algorithms is not problem specific. 


The executions are started with a script including problem data, TA 
algorithm-specific parameters and a call to the heuristic. The name of 


the script is a combination of the abbreviation of the heuristic andthe GA anere 
name of the data, cf. bipartite graph. For instance, an edge connecting 

the node GA with the node SSP means that GASSP.m is the script DE SSP 
solving the Subset Sum Problem (SSP) with the Genetic Algorithm =a 

(GA). 


The code for all the algorithms is organized in similar sequences of scripts and functions. The 
solutions are stored in the variables Sol and X and are printed by the scripts ResRealVec.m 
for the continuous case, and ResLogVec .m for the integer problem. They list the solutions corre- 
sponding to the successive restarts. For each solution, the percentage of steps/generations explored 
when searching for the best solution is also given. The script p1otRes .m provides graphics which 
inform about the working of the algorithm. In particular it shows the empirical distribution of the 
solutions for the restarts, the threshold sequence and the evolution of the objective function val- 
ues along the rounds and steps (or generations) for all restarts. The best value over the rounds (or 
generations) is marked with a circle. 


Listing 12.1: C-HeuristicsNutshell/M/./Ch12/ResRealVec.m 


% ResRealVec.m -- version 2018-11-23 
fprintf(’ Elapsed time %4.1f sec\n’,t1); 
[ign,ibest] = min(Sol); n = size(X,1); 
if exist(’TA’,’var’), P='R’; m=nR*nS; else P=’G’; end 
fprintf(’ ------------------------ \n’) 
for i = 1:nRe 
if i==ibest, tag ‘x’; else tag = ' '; end 
if strcemp(P,’R’) -- Rounds 
j = find(~isnan(JB(:,i)),1,’last’); 
p = ceil(100*( JB(j,i) / m)); 
else % -- Generations 
p = ceil(100*( JB(i) / nG )); 


æ Il 


j= 
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= 
= 


end 
fprintf ('%s Sol = %6.2f %3i%% x = [’,tag,Sol(i),p); 
for k = i:n, fprinti(’$5.2f “;X(k;i)); end; 
fprintf(’\b]\n’); 

end, fprintf(’ ------------------------- \n'); S = Sol; 


=. = m.e e 
ADM UN 


Listing 12.2: C-HeuristicsNutshell/M/./Ch12/ResLogVec.m 


% ResLogVec -- version 2018-11-23 
fprintf (' Elapsed time %4.1f sec\n’,t1); 
if exist(’TA’,’var’), P='R'; m=nR»nS; else P='G'; end 
w = Data.w; s = Data.s; 
fprintf (’ ------------------------ r) 
for i = 1:nRe 
J = find(X(:,i)); sol = sum(w(J)); nd = numel(J); 


ADM HPWN HE 


6. An R implementation with more-detailed discussion can be found in Chapter 14. 
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8 if abs(sol - s), tag = ' '; else tag = ‘*’; end 

9 if stremp(P,’R’) % -- Rounds 

10 j = find(~isnan(JB(:,i)),1,’last’); 

11 p = ceil(100*( JB(j,i) /m )); 

12 else % -- Generations 

13 p = ceil(100*( JB(i) / nG )); 

14 end 

15 fprintf(’\n%s Sol = %4i %3i%% %3i el.:'’,tag,sol,p,nd); 
16 for k = 1:nJ, fprintf('’%4i’,w(J(k))); end; 

17| end, fprintf(’\n ------------------------- \n’); S = Sol; 


Listing 12.3: C-HeuristicsNutshell/M/./Ch12/plotRes.m 


l|% plotRes.m -- version 2018-11-23 

2| figure(1), subplot(211) % plot CDF of solutions 

3)H = cdfplot (Sol); title(’Empirical CDF of solutions’) 

4| xlabel('’); ylabel(’'); set(gca,’FontSize’,15) 

5| set (H, ’Color’,.7*[1 0 0], ’LineWidth’ ,3); 

6| if exist(’ND’,‘var’) 

7 figure(1), subplot(212) % plot threshold sequence 

8 H = cdfplot(ND); xlabel(’’); ylabel(’’); hold on 

9 plot (TA.th,TA.Percentiles,'’ko’,’MarkerSize’,8); 

10 set (H,’Color’,.7*[1 1 1],‘’LineWidth’ ,2) 

11 title(’Emp. distr. of distances between neighbor sol.’); 
12 set(gca,’ytick’,fliplr(TA.Percentiles) ,’FontSize’,15); 
13 set(gca,’xtick’,fliplr(TA.th)); xlim([0 TA.th(1)]); 

14) end 

15| if strcemp(P,’R’) % Rounds plotted 

16 F = FF; J = JB; nR = size(J,1); 

17 nC = size(F,1); nS = nC/nR; P = ‘Rounds’; 

18| else % Generations plotted 

19 F = FG; P = ‘Generations’; 

20| end 

21| [nC,nRe] = size(F); Fmin = min(F(:)); Fmax = max(F(:)); 

22| for r = 1:nRe 

23 j = num2str(mod(r-1,2)+1); 

24 figure (fix((r+1)/2)+1), eval([’subplot(21’,j,’);'1); 
25 if stremp(P,’Rounds’) % -- Rounds 

26 plot(F(:,r),‘’k-’),hold on, K = J(~isnan(J(:,r)),4xr); 
27 plot (K,F(K,r),'ko’,'MarkerSize’',8,... 

28 'MarkerFaceColor','w'); 

29 for k = 1:nR-1 

30 plot (nS»[k k], [Fmin Fmax],’k:'’,’LineWidth’,2); 
31 end 

32 else % -- Generations 

33 plot(F(:,r),'ko’,’MarkerSize’,8,... 

34 ’MarkerFaceColor’,0.7*[1 1 1]); 

35 end 

36 set(gca,’FontSize’,15); xlim([0 nC]);ylim([Fmin Fmax]); 
37 xlabel([P,’ at restart ',num2str(r)]); grid on; 

38 if stremp(P,’Rounds’), set(gca,’xtick’,nS*(1:nR)); end 
39| end 


12.A.1 The problems 


The names of the MATLAB functions used to code the objective functions corresponding to the 
minimization of the following problems will be passed to the algorithms through the variable OF. 
The number and order of input arguments will be (x0,Data), the first being the variable for 
which we evaluate the function and a second argument containing information about the function 
parameters. 
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FIGURE 12.2 Shekel function. For better visibility — f has been plotted. 


Shekel function 


For the given matrix A and vector c 


418 63 25 8 6 7 
A= 418 679 5 1 2 3.6 
4 18 63 23 8 6 7 
418 679 1 3 2 36 


c=| o1 0.2 02 0.4 04 06 03 07 05 05 | 


and for x = [0, 10]”, 1 <n <4, and 1 < m < 10, the Shekel function 


-1 
m n 
fw=->)> (a +) Gi- a) (12.5) 
j=l i=l 
has m local minima. The global minimum is at x; = 4, where i = 1,...,n, whose value is close 


to —11.03 for m = 10 and n = 2. Fig. 12.2 shows the shape of the function which is coded in 
Shekel.m, 


Listing 12.4: C-HeuristicsNutshell/M/./Ch12/Shekel.m 


l| function F = Shekel (x, Data) 

2|% Shekel.m -- version 2018-11-05 

3}n = Data.n; m = Data.m; A = Data.A; c = Data.c; F = 0; 
4| for j = 1:m 

5 S= 0; 

6 for i = 1:n 

7 S=S + (x(i) - A(i,j))%2; 

8 end 

9 F=F 1/(c(j) + S); 

0| end 


while the script Sheke1Data.m contains the related data. 


Listing 12.5: C-HeuristicsNutshell/M/./Ch12/ShekelData.m 


lln = 2; m= 10; int = repmat([0 10],n,1); % domain 
2A = [4186325866 7; 4186795 1 2 3.63 
3 4186323867; 41867913 2 3.6]; 
dle = [0.1 0.2 0.2 0.4 0.4 0.6 0.3 0.7 0.5 0.5]; 

5] Data = struct(‘A’,A,‘c',c,’int’,int,’n’,n,’m’,m); 
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The subset sum problem 


As an illustration of an integer problem, we solve the subset sum problem where, given the 
set of integers J, we have to find a subset Q of elements whose sum is s. For example, given 
Q = {—9, —5, —2, 3,4, 7} the subset with sum s = 8 is J = {—2, 3, 7}. The knapsack and the par- 
titioning problem are special cases of the subset sum problem.’ 

The function SSP .m codes the objective function to be minimized. It is a non-negative function 
I jeJ Wj TS | which is zero if the sum of the elements of the set J equals s. 

SSPData.m contains the related data, i.e. a randomly generated array w of integers in the 
range [1, 900] of length n = 500 and s = 1000. 


Listing 12.6: C-HeuristicsNutshell/M/./Ch12/SSP.m 


function F = SSP(J,Data) 


% Subset sum problem -- version 2018-08-26 
F = abs(sum(Data.w(J)) - Data.s); 


U Ne 


Listing 12.7: C-HeuristicsNutshell/M/./Ch12/SSPData.m 


n = 500; w = randperm(900,n); s = 1000; 
Data = struct (‘w’,w,’n’,n,'s’,s); 


Ne 


12.A.2 Threshold Accepting 


The file dependencies of the code for the Threshold Accepting (TA) algorithm are shown in the 
following graph. ‘xxx’ stands for the problem names, i.e. Shekel or SSP. 


SS 
xxxData : Sana == OF 
TAxxx goTA TAH NF 


ResRealVec/ResLogVec 
plotRes 


The names of the user-provided problem-specific functions for computing the starting solutions SS, 
the neighbor solutions NF as well as the objective function OF are defined in TAxxx.m. 

Again, in order to be compatible with the calls to these functions in the code, the input arguments 
have to be: (n, nRe, OF, Data) for the function SS and (x0, j , Data, TA) for the function NF 
where n is the dimension of the variable, nRe the number of restarts, x0 the current solution vector 
and j is the index of the element in the current solution which is involved in the definition of a new 
neighbor solution. These indices are generated at once before the calls to the function. If some of the 
arguments are not used inside the functions they are named ignore. As to the output arguments, 
the function OF returns the function value and NF the coordinates of the generated neighborhood. 
The function SS returns Fnrex1 and XnxnRe, the function values and corresponding coordinates as 
well as Fbest and rbest the value and index of the best solution. These two variables are not 
used by all algorithms. 

The script goTA.m calls the functions computing the starting solution SS, thSequence for 
the computation of the threshold sequence and TAH containing the core of the algorithm. 


Listing 12.8: C-HeuristicsNutshell/M/./Ch12/goTA.m 


% goTA.m -- version 2018-11-23 

if ~exist(’scale’,’var’), scale = 0.2; end 
if ~exist('’ptrim’,’var’), ptrim = 0.8; end 
if ~exist(’upctl’,’var’), upctl = 0.9; end 


TA = struct(’OF’,OF,’SS’,SS,’‘NF’,NF,... 
’Restarts’,nRe,’Rounds’,nR,’Steps’,nS,... 
'scale’,scale,’ptrim’,ptrim, ’upctl’,upctl); 


ADM HPWN Ee 


7. Chapter 14 provides a more detailed discussion of the problem and an R implementation. 
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8|FF = NaN(nR*nS,nRe); JB = NaN(nR,nRe); n = Data.n; 
9| Sol = NaN(nRe,1) ; X = NaN(Data.n,nRe); t0 = tic; 
10| [Fs,xs] = feval(SS,n,nRe,OF,Data); % Starting solutions 
ll} if ~exist(‘th’,’var’) 

12 TA.Percentiles = linspace(upctl,0,nR) ; 

13 [TA.th,ND] = thSequence(Fs(1),xs(:,1),TA,Data) ; 

14| else TA.th = th; ND = []; end 

15| for r = 1:nRe 

16 [Sol (r),X(:,r),FF(:,r),JB(:,r)] = TAH (TA, Data, i... 
17 Fs(r),xs(:,r)); 
18| end, t1 = toc(t0); 


Listing 12.9: C-HeuristicsNutshell/M/./Ch12/thSequence.m 


1| function [th,ND] = thSequence (F0,x0,TA, Data) 

2|% thSequence.m -- version 2018-11-23 

3| OF = TA.OF; NF = TA.NF; nND = min(TA.RoundssTA.Steps,5000) ; 
4| NDO = zeros(1,nND); JJ = unidrnd(Data.n,nND,1); 
5| for s = 1:nND 

6 xl = feval(NF,x0,JJ(s),Data,TA); 

7 Fl = feval(OF,x1,Data); 

8 NDO(s) = Fl - FO; FO) = PTs x0 = xl; 

9| end 

10| ND = sort(abs(NDO)); ip = find(ND > 0,1); 

ll|ntrim = max(fix((nND-ip)*TA.ptrim) ,10); 

12| ND = ND(ip:end-ntrim) ; 

13| th = quantile(ND,TA.Percentiles); th(TA.Rounds) = 0; 


Listing 12.10: C-HeuristicsNutshell/M/./Ch12/TAH.m 


function [Fbest,xbest,FF,JB] = TAH(TA,Data,F0,x0) 
% TAH.m -- version 2018-10-26 
nR = TA.Rounds; nS = TA.Steps; OF = TA.OF; NF = TA.NF; 
FF = NaN(nR*nS,1); JB = NaN(nR,1); JB(1) = 1; 
Fbest = FO; xbest = x0; FF(1) = FO; j= 0; 
for iround = 1:nR 

J = unidrnd(Data.n,nS,1); 

for istep = 1:ns 

xl = feval(NF,x0,J(istep) ,Data,TA) ; 


Fl = feval(OF,xl,Data); j=j +1; FF(j) = Fl; 
if Fl <= FO + TA.th(iround) 
FO = F1; x0 = x1; 
if Fl < Fbhest 
Fbest = F1; xbest = x1; JB(iround) = j; 


end 
end 
end 
end 
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The structure TA in goTA.m holds the parameters of the TA algorithm which have been specified in 
TAxxx.m. There are three preset parameters, scale, ptrim and upct1 whose values may oc- 
casionally be redefined within TAxxx .m. The parameter scale defines the scaling of the random 
variable used to perturb the current solution for the computation of a neighborhood in continuous 
problems. The other two parameters guide the computation of the threshhold sequence. ptrim 
defines the proportion of the sequence of sorted neighborhood distances to be trimmed in the upper 
tail while upct1 is the upper percentile of the distribution of neighbor distances which defines the 
value of the first threshold. If the first threshold appears to be too large, one can reduce its value. In 
general, these parameters are robust with respect to different optimizations. 

Notice that it is also possible to specify a particular threshold sequence by declaring in 
TAxxx .m_an array th of length nR with the decreasing thresholds. 
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Minimization of Shekel function 


The script TAShekel.m reads the data, declares the names of the objective function OF, the 
function NF for the computation of the neighborhood and the function SS, which computes the 
starting solution, the number of restarts, rounds and steps and, optionally, the parameters scale, 
ptrim, upctl and th. 


Listing 12.11: C-HeuristicsNutshell/M/./Ch12/TAShekel.m 


2| ShekelData, OF = ‘Shekel’; SS = ‘SrealVec’; NF=’CNeighbor’ ; 
3)/mRe = 5; mR = 6; nS = 8000; 

4) goTA 

5| TA, ResRealVec, plotRes 


The function SrealVec .m computes the starting solutions for all nRe restarts by randomly gen- 
erating the components of a vector lying in the domain of the function. It will also be used for 
computing the starting populations in the Differential Evolution (DE) and Particle Swarm (PS) 
algorithms. 


Listing 12.12: C-HeuristicsNutshell/M/./Ch12/SrealVec.m 


1| function [F,X,Fbest,rbest] = SrealVec(n,nR, OF, Data) 
2|% SrealVec.m -- version 2018-11-01 (Starting sol./pop.) 
3|F = zeros(nR,1); X = zeros(n,nR); int = Data.int; 

4| Fbest = realmax; 

5| for f = Lenk 

6 for i= 41:n 

7 lowlim = int(i,1); uplim = int(i,2); 

8 X(i,r) = lowlim + (uplim - lowlim) * rand; 

9 F(r) = feval(OF,X(:,r),Data); 

10 if F(r) < Fbhest, Fbest = F(r); rbest = r; end 
11 end 

12| end 


Neighborhood solutions are computed with the function CNeighbor .m which adds a normally 
distributed random variable scaled by the factor scale to a randomly selected component of the 
current solution x0. The new value is truncated if not in the domain of definition. The input argu- 
ments are as explained before. Notice that the sequences of random variables used in the successive 
calls of the function are generated and stored in an array with a single instruction before the call to 
the function in order to improve efficiency. 


Listing 12.13: C-HeuristicsNutshell/M/./Ch12/CNeighbor.m 


1| function x1 = CNeighbor(x0,j,Data, TA) 

2|% CNeighbor.m -- version 2018-10-25 

3x1 = x0; 

4|% Randomly select element j of x0 

5|x1(j) = x1(j) + randn * TA.scale; int = Data.int; 
6|% Check domain constraints 

7} x1(j) = min(int(j,2), max(int(j,1),x1(j)) J; 


Fig. 12.3 shows the empirical distribution of the solutions. For this execution the minimum has 
been reached twice. 


>> TAShekel TA = 
OF: ‘Shekel’ 
SS: ‘'SrealVec’ 
NF: ‘CNeighbor’ 
Restarts: 5 
Rounds: 6 
Steps: 8000 
scale: 0.2000 
ptrim: 0.8000 
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FIGURE 12.3 Shekel function. Empirical distribution of solutions. 


upctl: 0.9000 
Percentiles: [0.9000 0.7200 0.5400 0.3600 0.1800 0] 
th: [0.0068 0.0050 0.0034 0.0020 8.9719e-04 0] 
Elapsed time 5.2 sec 


Sol = -2.90 92% x= [ 6.96 3.59] 
Sol = -11.03 96% x= [ 4.00 4.00] 
Sol = -2.90 98% x= [ 6.96 3.59] 
* Sol = -11.03 55% x = [ 4.00 4.00] 
Sol = -5.37 90% x = [ 8.00 8.00] 


Fig. 12.4 shows the threshold sequence and the evolution of the objective function value over 
the 6 rounds of the 4th restart. The reported solution has been identified in round 4 at around 26,000 
steps (cf. circle). Notice that already in round one a sufficiently precise solution had been found. 
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FIGURE 12.4 Shekel function. Left panel: Threshold sequence. Right panel: Objective function values over the 6 rounds 
of the 5th restart. 


Solving the subset sum problem 


The script TASSP .m defines the names of the problem-specific objective OF, starting solutions 
SS, neighborhood NF functions, the values of the TA parameters then calls a script displaying the 
results. 


Listing 12.14: C-HeuristicsNutshell/M/./Ch12/TASSP.m 


SSPData; OF = ’SSP’; SS = 'SlogVec’; NF = 'DNeighbor’; 
nRe = 5; nR 6; nS = 8000; Sth = [50 30 10 5 2 01; 
goTA 

TA, ResLogVec, plotRes 


I 
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The function SlogVec .m generates the starting solutions by randomly adding elements to a set as 
long as its sum is inferior to s. 


Listing 12.15: C-HeuristicsNutshell/M/./Ch12/Slog Vec.m 


1} function [F,X,Fbest,rbest] = SlogVec(n,nR, OF, Data) 
2|% SlogVec -- Version 2018-11-16 

3|w = Data.w; s = Data.s; 

4|X = false(n,nR); F = zeros(nR,1); Fbest = realmax; 
5| for r = Lear 

6 J = randperm(n); i = 1; 
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7 while F(r) < s 

8 j = J(i); 

9 F(r) = F(r) + w(j) 

10 X(j,r) = true; 

11 i=i+41; 

12 end 

13 if F(r) < Fbest, Fbest = F(r); rbest = r; end 
14) end 


A neighbor solution is generated by randomly adding or deleting a randomly selected element from 
Q to the current solution. This simple procedure can generate a neighbor solution which remains 
unchanged. 


Listing 12.16: C-HeuristicsNutshell/M/./Ch12/DNeighbor.m 


1| function x1 = DNeighbor(x0,j,ignorel, ignore2) 

2|% DNeighbor.m -- version 2018-10-24 

3)x1 = x0; 

4| i£ rand < 1/2, x1(j) = true; else x1(j) = false; end 


The definition of neighborhoods considered can generate large jumps of the objective function in 
the succeeding steps which again results in high thresholds. One can override the algorithm gen- 
erating the thresholds and specify a particular sequence of thresholds instead (see the commented 
instruction in TASSP.m). 


>> TASSP TA 


OF: “SSP” 
SS: 'SlogVec’ 
NF: ‘DNeighbor’ 
Restarts: 5 
Rounds: 6 
Steps: 8000 
scale: 0.2000 
ptrim: 0.8000 
upctl: 0.9000 
Percentiles: [0.9000 0.7200 0.5400 0.3600 0.1800 0] 
th: [177.8000 148 121 79 40.4800 0] 
Elapsed time 5.4 sec 
x Sol = 1000 72% 16 el 95 9 168 1164 49 2: 29 
7193 54 36 75 17 79 22 
* Sol = 1000 81% 18 el 47 15 40 52 49 2 69 7 
88 59 67 31 54 123 42 148 
85 22 
Sol = 999 61% 20 el 61 74 63 70 2 29 69 1 
11 59 48 31 44 56 54 45 
42 17 20 197 
* Sol = 1000 52% 20 el 61 9 74 63 70 116 2 69 
99 7 50 31 44 56 103 45 
42 17 20 22 
* Sol = 1000 55% 22 el 61 15 70 1 49 29 69. T4 
50 48 67 31 44 54 110 8 
45 42 75 20 79 22 


Four subsets, out of the five computed, have the desired sum. The cardinality ranges from 16 to 
22. Fig. 12.5 displays the threshold sequence and the evolution of the objective. Notice that for 
this type of problem the value of the objective function at the minimum is known, i.e. zero, and 
therefore we could stop the restarts as soon as such a solution has been found. 
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FIGURE 12.5 Subset sum problem. Left panel: Threshold sequence. Right panel: Objective function values over the 6 
rounds of the Sth restart. 


12.A.3 Genetic Algorithm 


For the Genetic Algorithm (GA), the file dependencies of the code are the following. As explained 
before, ’xxx’ stands for the problem names (in this case only SSP). 


SP o OF 
ee GAH Toa 
GAxxx goGA ae 
ResLogVec ene 
plotRes 


The names of the objective function OF and SP, the function for the computation of the starting 
population, have to be provided in the script GAxxx .m together with nC, the chromosome length, 
and nRe, the number of restarts. 

The set of parameters for the Genetic Algorithm are preset in goGA and stored in the structure 
GA with the following fields: 


nc Chromosome length 
nG Number of generations 
nP Population size 
nP1 Size of mating pool 
nP2 Number of children generated 
pM Probability to undergo mutation 


All these parameters may be overwritten by redefining them in GAxxx. 


Listing 12.17: C-HeuristicsNutshell/M/./Ch12/goGA.m 


l|% goGA.m -- version 2018-11-13 

2)if ~exist(’nG’,’var’), nG = 80; end 

3) if ~exist(’pM’,’var’), pM = 0.2; end 

4| if ~exist(’nP’,’var’), nP = 200; end 

5| if ~exist(’nP1l’,’var’), nP1 = 100; end 

6| if ~exist(’nP2’,’var’), nP2 = 200; end 

7|GA = struct (’OF’,OF,’SP’,SP,‘’nC’,nC,‘nG’,nG,’nP’,nP,... 
8 ‘nPl1’,nP1,’nP2’,nP2,’pM’,pM, ’Restarts’,nRe); 
9| Sol = NaN(nRe,1); X = zeros(nC,nRe) ; tO = tic; 

10| FG = NaN(nG,nRe); JB = NaN(1,nRe); 

11] for r = 1:nRe 

12 [Sol(r),X(:,xr),FG(:,r),JB(r)] = GAH (GA, Data); 

13| end, tl = toc(t0); 


The function GAH is the core of the Genetic Algorithm. It calls SP, the function for generating the 
starting population, generates the mating pool and the children and stores them in the structures P, 
P1 and P2 respectively. The field C holds the chromosomes while the field F is the corresponding 
fitness. 


Listing 12.18: C-HeuristicsNutshell/M/./Ch12/GAH.m 


function [Fbest,xbest,FG,rbest] 
% GAH.m -- version 2018-11-09 


© 


1 
2 


GAH (GA, Data) 
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3)nG = GA.nG; nP2 = GA.nP2; nC = GA.nC; OF = GA.OF; SP=GA.SP; 
4|n = Data.n; nP = GA.nP; FG = NaN(1,nG); 

5| [F,X,Fbest] = feval(SP,n,nP,OF, Data) ; 

6) P = struct (IEEE "Cj; X").4 P2.C = false(nP2,nC); 

7| for r = 1:nG 

8 P1 = MatingPool(GA,P); 

9 for i = 1:nP2 

10 P2.C(i,:) = Crossover (GA, P1); 

11 P2.C(i,:) = Mutate(GA,P2.C(i,:)); 

12 end 

13 P = Survive(GA,P1,P2,OF,Data); 

14 if P.F(1) < Fbest, 

15 Fbest = P.F(1); xbest = P.C(1,:); rbest = r; 

16 end 

17 FG(r) = P.F(1); 

18 if (Fbest == 0) || all(~diff(P.F)), break, end %--stop 
19| end 


The four key tasks—selection of a mating pool, crossover, mutation, and selection of survivors—are 
coded in the following separate functions. 


Listing 12.19: C-HeuristicsNutshell/M/./Ch12/MatingPool.m 


1) function P1 = MatingPool (GA, P) 
2|% MatingPool.m -- version 2018-08-21 
3)nP = GA.nP; nP1 = GA.nP1; 
4| IP1 = unidrnd(nP,1,nP1); % Set of indices defining P1 
5| P1.C = P.C(IP1,:); 
6| P1.F = P.F(IP1); 
Listing 12.20: C-HeuristicsNutshell/M/./Ch12/Crossover.m 
1) function C = Crossover (GA, P1) 
2|% Crossover.m -- version 2018-08-21 
3|nP1 = GA.nP1; nC = GA.nC; 
4| par = unidrnd(nP1,2,1); % Select parents 
5|p1 = par(1); p2 = par(2); 
6|k = unidrnd(nc - 1) + 1; % Select crossover 
7C = [P1.C(p1, (1:k-1)) P1.C(p2,k:nC)]; % crossover 
Listing 12.21: C-HeuristicsNutshell/M/./Ch12/Mutate.m 
1} function C = Mutate(GA,C) 
2|% Mutate.m -- version 2018-08-21 
3) if rand < GA.pM 
4 % Child undergoes mutation 
5 j = unidrnd(GA.nC); % Select chromosome 
6 if ~C(j), C(j) = 1; else C(j) = 0; end 
7| end 


Listing 12.22: C-HeuristicsNutshell/M/./Ch12/Survive.m 


function P = Survive (GA, P1, P2,OF,Data) 


2 


1 
2|% Survive.m -- version 2018-08-23 

3)/nP = GA.nP; nP1 = GA.nP1; nP2 = GA.nP2; 

4| P2.F = zeros(1,nP2); 

5| for i = 1:nP2 

6 P2.F(i) = feval(OF,P2.C(i,:),Data); % Children fitness 
7| end 

8)F = [P1.F,P2.F]; 

9| [ignore,I] = sort(F); 

10| for i = 1:nP 

11 je T Er 
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12 if j <= nP1 

13 PCi) = Plt s2)2 % Surviving parents 
14 P.F(i) = P1.F(j); 

15 else 

16 P.C(i,:) = P2.C(j-nP1,:); % Suriving children 
17 P.F(i) = P2.F(j-nP1); 

18 end 

19| end 


Solving the subset sum problem 


The script GASSP .m defines the names of the functions OF and SP, the length of the chromosome 
nC and the number of restarts nRe. As this problem solves very easily we have reduced the number 
of generations nG to 10. Notice that the starting population SP is computed with the same function 
already used for the starting solutions in TASSP.m. 


Listing 12.23: C-HeuristicsNutshell/M/./Ch12/GASSP.m 


2|SSPData; OF = ‘SSP’; SP = ’SlogVec’; nC=n; nRe=5; nG=10; 
3| goGA 
4| GA, ResLogVec, plotRes 


Running GASSP produces the following results: 


>> GASSP GA = 


OF: ‘SSP’ 

SP: 'SlogVec’ 
nG: 10 

pM: 0.2000 
nc: 500 

nP: 200 
nP1: 100 
nP2: 200 


Restarts: 5 
Elapsed time 0.4 sec 


4 el.: 352 157 288 203 
3 el.: 465 61 474 

* Sol = 1000 60% 4 el.: 193 204 101 502 
2 el.: 174 826 
3 el.: 295 649 56 


The Genetic Algorithm solves this problem very efficiently which is not surprising given the perfect 
correspondence between the logical array holding our variable and the chromosome. For this execu- 
tion, all restarts provided a subset verifying the constraint to sum to s. Notice that there is a stop in 
GAH .m as soon as a solution has been reached (line 18). For problems where the value of the min- 
imum of the objective function is not known, this instruction has to be put in comments. Fig. 12.6 


FIGURE 12.6 Subset sum problem: Evolution of the fittest solution in restart 3. The algorithm stops at generation 6. 
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shows the evolution of the fittest over the generations of the 3th restart. Already at generation 6 a 
solution has been found and the search has been stopped. 


12.A.4 Differential Evolution 


The file dependencies of the code for the Differential Evolution (DE) algorithm is shown below. 
The script DExxx.m specifies the names of the functions OF and SP and defines the number of 


restarts nRe. 
xxxData A SP Ay 
DExxx goDE e ə DEH OF 


ResRealVec 
plotRes 


The parameters of the Differential Evolution algorithm are preset in goDE and stored in the struc- 
ture DE with the fields: The parameters can be overwritten by the user in DEXXX . m. 


nP Size of population 
nG Number of generations 
CR Parameter 

F Parameter 


oneElemfromPv Parameter 


Listing 12.24: C-HeuristicsNutshell/M/./Ch12/goDE.m 


l|% goDE.m -- version 2018-10-30 

2| if ~exist(’nG’,’var’), nG = 40; end 

3| if ~exist(’CR’,’var’'), CR = .8; end 

4| if ~exist(’F’,’var’), F = .8; end 

5| if ~exist(’nP’,’var’'), nP = 15; end 

6| if ~exist(’oneElemfromPv’,’var’), oneElemfromPv = 1; end 
7|/DE = struct (’OF’,OF,’SP’,SP,’nG’,nG,’CR’,CR,'F’,F,’nP’,... 
8 nP, ‘oneElemfromPv’,oneElemfromPv, ’Restarts’,nRe); 
9| Sol = NaN(nRe,1); X = zeros (Data.n,nRe) ; 

10| FG = zeros(nG,nRe); JB = NaN(1,nRe); t0 = tic; 

1l| for r = 1:DE.Restarts 

12 [Sol(r),X(:,xr),FG(:,r),JB(r)] = DEH(DE, Data); 


13) end, t1 = toc(t0); 


The function DEH.m implements the Differential Evolution algorithm. 


Listing 12.25: C-HeuristicsNutshell/M/./Ch12/DEH.m 


l| function [Fbest,xbest,FG,kbest] = DEH (DE, Data) 

2|% DEH.m -- version 2018-10-31 

3) n = Data.n; nP = DE.nP; nG = DE.nG; 

4| OF = DE.OF; SP = DE.SP; Col = 1:nP; FG = NaN(nG,1); 

5| [F,P1,Fbest] = feval(SP,n,nG,OF,Data); % --Starting pop. 
6| for k = 1:nG 

7 PO = Pl; 

8 Io = randperm(nP) '; Ic = randperm(4)’; 

9 I = circshift(Io,Ic(1)); R1 = circshift(Io,Ic(2)); 
10 R2 = circshift(Io,Ic(3)); R3 = circshift(Io,Ic(4)); 
11 % -- Construct mutant array 

12 Pv = PO(:,R1) + DE.F >» (PO(:,R2) - PO(:,R3)); 

13 % -- Crossover 

14 mPv = rand(n,nP) < DE.CR; 

15 if DE.oneElemfromPv 

16 Row = unidrnd(n,1,nP); 

17 mPvl = sparse(Row,Col,1,n,nP); 

18 mPv = mPv | mPv1; 
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19 end 

20 mPO = ~mPv; 

21 Pu(:,I) = PO(:,1I).*mPO + mPv.*Pv; 

22 % -- Select array to enter new generation 
23 flag = 0; 

24 for i= Irne 

25) Ftemp = feval(OF,Pu(:,i),Data); 

26 if Ftemp <= F(i) 

27 PiCry2)- = Puls, d) + 

28 F(i) = Ftemp; 

29 if Ftemp < Fbest 

30 Fbest = Ftemp; xbest = Pu(:,i); flag = 1; 
31 end 

32 else 

33 PL(Cs,4) = PO; 

34 end 

35 end 

36 if flag, FG(k) = Fbhest; kbest = k; end 

37| end 


Minimization of the Shekel function 


Listing 12.26: C-HeuristicsNutshell/M/./Ch12/DEShekel.m 


2| ShekelData, OF = ‘Shekel’; SP = 'SrealVec’; nRe = 5; 
3| goDE 
4| DE, ResRealVec, plotRes 


Running DESheke1 produces the following results: 


>> DEShekel DE = 
OF: ‘Shekel’ 
SP: ‘SrealVec’ 


nG: 40 

CR: 0.8000 
F: 0.8000 

nP: 15 


oneElemfromPv: 1 
Restarts: 5 
Elapsed time 0.1 sec 


Sol = -11.03 100% x= [ 4.00 4.00] 
Sol = -11.03 98% x= [ 4.00 4.00] 
Sol = -11.03 1003 x= [ 4.00 4.00] 
Sol = -11.03 93% x= [ 4.01 4.00] 
* Sol = -11.03 100% x = [ 4.00 4.00] 


Fig. 12.7 illustrates the working of the algorithm for the optimization of the Shekel function. 
Differential Evolution appears to be very efficient for the problems solved. 
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FIGURE 12.7 Differential Evolution optimization. Left panel: Empirical distribution of solutions. Right panel: Best value 
over generations of restart 3. 
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12.A.5 Particle Swarm Optimization 


The file dependencies of the code for the Particle Swarm (PS) optimization are the following: 


xxxData 2 “e 
PSxxx goPS e ə PSH OF 


ResRealVec 
plotRes 


The parameters of the Particle Swarm Optimization algorithm are preset in goPS .m and stored in 
the structure PS with the fields: The parameters can be overwritten by the user in PSxxx.m. 


nP Size of population 
nG Number of generations 


cl Parameter 

c2 Parameter 

cv Parameter 

cvmax Parameter 
Restarts Number of restarts 


Listing 12.27: C-HeuristicsNutshell/M/./Ch12/goPS.m 


l|% goPS.m -- vereion 2018-11-14 

2| if ~exist('nP', a nP = 15; end 

3| if ~exist(’nG’,’var’'), nG = 40; end 

4| if ~exist(’cl’,’var’), cl‘=> 27 end 

5| if ~exist(’c2’,’var'), c2 = 2; end 

6| if ~exist('cv','var'), cv = 1; end 

7| if ~exist ('vmax','var'), vmax =1; end 

8|PS = struct('OF’,OF,’SP’,SP,’nP’,nP,’nG’,nG,’c1’,cl, 
9 'e2',c2,'cv',cv, ‘vmax’,vmax, ‘Restarts’ ,nRe) ; 
10| Sol = NaN(nRe,1); X = zeros(Data.n,nRe) ; 

11] FG = zeros(nG,nRe); JB = NaN(1,nRe); t0 = tic; 

12| for r = 1:nRe 

13 [Sol(r),X(:,xr),FG(:,r),JB(r)] = PSH(PS,Data) ; 

14| end, t1 = toc(t0); 


The function PSH .m implements the Particle Swarm Optimization algorithm. 


Listing 12.28: C-HeuristicsNutshell/M/./Ch12/PSH.m 


function [Gbest,xbest,FG,gbest] = PSH (PS, Data) 
% PSH.m -- version 2018-11-01 
OF = PS.OF; SP = PS.SP; 
nP = PS.nP; nG = PS.nG; d = Data.n; FG = NaN(nG,1); 
[F,P,Gbest,gbest] = feval(SP,d,nP,OF,Data); %--Start. pop. 
P = P’; Pbest = P; Fbest = F; v = PS.cv * rand(nP,d); 
for k = 1:nG 

v =v + PS.cl* rand(nP,d).*(Pbhest - P) + 

PS.c2* rand(nP,d).*(ones(nP,1)*Pbest (gbest,:)-P); 
v = min(v, PS.vmax); 
v = max(v,-PS.vmax) ; 


P=P+v; 
for y= TP 

F(i) = feval(OF,P(i,:),Data); 
end 


eee ee ee 
NNPWNYNF TUANANIADUNKWN HE 


I = find(F < Fbest); 


17 if ~isempty (I) 

18 Fbest (I) = F(I); 

19 Pbhest(I,:) = P(I,:); 

20 [Fmin,ib] = min(F(I)); 
21 if Fmin < Gbest 
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22 Gbest = Fmin; gbest = I(ib); FG(k) = Gbhest; 
23 end 

24 end 

25| end 

26|xbest = Pbhest(gbest,:); 


Minimization of the Shekel function 


Listing 12.29: C-HeuristicsNutshell/M/./Ch12/PSShekel.m 


2|ShekelData, OF = ‘Shekel’; SP = ‘SrealVec’; nRe = 5; 
3| goPS 
4| PS, ResRealVec, plotRes 


Running PSSheke1 produces the following results: 


>> PSShekel PS = 
OF: ‘Shekel’ 
SP: ‘SrealVec’ 


nP: 15 

nG: 40 

el: 2 

CRE 2 

cv: 1 

vmax: 1 
Restarts: 5 


Elapsed time 0.0 sec 


* Sol = -11.02 38% x= [ 4.00 4.00] 
Sol = -10.96 13% x= [ 4.00 3.98] 
Sol = -10.98 20% x= [ 4.02 4.02] 
Sol = -10.94 8% x= [ 3.98 4.02] 
Sol = -10.99 % x = [ 4.02 3.99] 


Fig. 12.8 illustrates the workings of the Particle Swarm algorithm for the optimization of the Shekel 
function. 
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FIGURE 12.8 Particle Swarm optimization. Left panel: Empirical distribution of solutions. Right panel: Best value over 
generations of restart 1. 


Appendix 12.B Parallel computations in MATLAB 


If the performance of a program appears to be a hindrance one may think about tuning it. Useful 
guidelines about when and how to tune a program can be found in Altman (2015, Ch. 1). An 
interesting feature for speeding up execution is parallelization. The Parallel Computing Toolbox 
provides a collection of functions and language constructs allowing access to computing resources 
(CPU nodes or GPU’s) on the same machine.’ The restart loops in our heuristics are task-parallel 


8. For an introduction to the MATLAB Distributed Computing Server, which enables parallelization across a cluster, grid or 
cloud, see Altman (2015, Ch. 6.3) or MATLAB’s documentation. 
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for which a parallel execution is easy to implement. First, some notations and theoretical aspects of 
parallelization are presented. Next, a simple example illustrates the syntax for distributing parallel 
tasks to a set of workers and finally parallelization will be applied to one of the codes introduced in 
Section 12.A. 

Notice that there exists a limit for the maximal theoretical speedup, known as Amdahl’s 
law (Amdahl, 1967). Consider a program where a portion of the code can be executed in paral- 
lel. The time T; to execute the code on a single processor can be written as 


Ti = tser + boar (12.6) 


where fer is the time spent to execute the non-parallelizable code and tpar the time spent in the 
portion of the code which is parallelizable. The execution time of this program with p workers in 
parallel is then 


Tp = tser + 'par/p + that (12.7) 


where fiat is the latency and overhead to start and shut down the p workers. Speedup and corre- 
sponding efficiency are defined as 


T S 
Sp = L and Ep = = 
Tp p 
Normalizing Eq. (12.6) we have 
ser Ípar 
1= =+ >= =(1-0)+90 
T, + T, ( )+ 


where 0 represents the proportion of parallelizable code and | — 0 the remaining proportion of 
serial code. Neglecting fia, speedup for execution with p processors is then 


te da-0)+0 | 1 
P™(1-60)+%/p (1-0) + 9/p 


and we observe that the limit of speedup for a growing number of p workers is 


1 
lim S, = ——. 
poco P (1-6) 
The code in DCex.m is divided into two parts. The first mimics a serial code by means of the 
pause(tser) command while the second part mimics code that suits parallel execution by 
means of apause (ptime) command inside a loop. The tic/toc commands measure execution 
times. 


Listing 12.30: C-HeuristicsNutshell/M/./Ch12/DCex.m 


3/t0 = tre} 

4| pause (tser) % serial code 

5| for k = 1:nRe % parallelizable code 

6 pause (ptime) 

7| end 

8&8 T1 = toc(t0); % Time for serial execution 

9|% ------------------------------------------------ 
10| pool = parpool; % Create parallel pool 

Il) tO = tics 

12|% -- Parallel execution -- 

13| pause (tser) % serial code 

14| parfor k = 1:nRe % parfor replaces for 

15 pause (ptime) 

16| end 

17| Tp = toc(t0); p = pool.NumWorkers; delete (pool) 
18| Sp = T1/Tp; Ep = Sp/p; 
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Command parpool creates a parallel pool of workers and returns an object describing the pool.” 
parfor executes the body of each iteration independently and in an unspecified order. Thus, all 
accessed data must be defined in the loop. After the computations, the command delete (pool) 
releases the workers and closes the pool. Running the code for values of tser=8, ptime=5 and 
nRE=10, we obtain 


In practice, tser and tpar are not explicit but can be retrieved from the system of equations 


(12.6-12.7) 
ic HEF | 
1 1/p | ipar Tp — tiaJ 


Given the values of 7; and T, measured for the executions of DCex.m and neglecting fiat, we 
obtain fser = 9 and fpar = 49 which are close to the values fixed in the code (tser=8 and 
ptime*nRe=50). The proportion of parallelizable code is then 6 = fpar/ T1 = 49/58 = 0.84 and 
the theoretical maximal speedup for this code is Soo = 1/(1 — 0) = 6.25. Fig. 12.9 shows the the- 
oretical speedup and efficiency as a function of the number of processors and the experimental 
values for p = 2. 
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FIGURE 12.9 Code DCex.m. Theoretical speedup and efficiency as a function of processors. Dots represent computed 
values. 


12.B.1 Parallel execution of restart loops 


It is straightforward to execute the restart loops of the previously presented algorithms in parallel. 
One just needs to start a parallel pool with the command pool=parpool and modify the syntax 
of the restart loop. As an example we consider the code for the Threshold Accepting algorithm and 
solve the subset sum problem. 

We change the syntax for the loop in goTA.m and rename this modified script to goTApar .m. 


Listing 12.31: C-HeuristicsNutshell/M/./Ch12/goTApar.m 


15} parfor r = 1:nRe 

16 [Sol(r),X(:,r),FF(:,r),JB(:,xr)] = TAH(TA,Data,... 

17 Fs(r),xs(:,r)); 
18}end, t1 = toc(t0); 


The script starting the execution of the subset sum problem with parallelized restart loops is then 
TASSPpar. 


Listing 12.32: C-HeuristicsNutshell/M/./Ch12/TASSPpar.m 


3)nRe = 5; mR = 6; nS = 400; 
4| pool = parpool; 
5| goTApar 


9. For options, see MATLAB’s help documentation. 
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6| TA, ResLogVec, plotRes 
7| delete (pool) 


>> TASSPpar 
Starting parallel pool (parpool) using the ‘local’ profile ... connected 
to 2 workers. 
TA = 
OF: ‘SSP’ 
SS: 'SlogVec0O’ 
NF: ‘DNeighbor’ 
Restarts: 5 
Rounds: 6 
Steps: 8000 
scale: 0.2000 
ptrim: 0.8000 
upctl: 0.9000 
Percentiles: [0.9000 0.7200 0.5400 0.3600 0.1800 0] 
th: [169.1000 136.4600 98 51 31.2200 0] 
Elapsed time 3.6 sec 


* Sol = 1000 56% 21 el.: 69 29 13 25 64 49 37 47 
1 32 36 83 27 10 48 24 
72 114 142 38 40 
* Sol = 1000 64% 22 el.: 29 84 25 43 37 47 1 9 
36 51 48 24 147 41 22 6 
50 2 26 38 129 105 
* Sol = 1000 40% 23 el 55 42 67 13 64 79 86 34 
47 32 18 27 44 10 51 22 
82 6 50 91 2 38 40 
* Sol = 1000 39% 18 el.: 136 29 21 63 164 43 49 119 
il 9 18 24 41 72 33 88 
52 38 
* Sol = 1000 86% 17 el.: 69 29 21 43 133 32 10 51 
100 108 22 50 111 88 2 26 
105 


Parallel pool using the ‘local’ profile is shutting down. 


Compared to the serial execution of the same problem in the previous section we obtain with 2 
workers a speedup of S2 = 5.4/3.6 = 1.5. 

In order to find subsets with smaller cardinality, as found previously when solving the problem 
with a Genetic Algorithm, the number of steps has been increased tenfold to 80,000. This comes 
with a corresponding tenfold increase in execution time but without finding subsets with smaller 
cardinality. At this point, one may show that investing in an improvement of the optimization model 
can be by far more efficient than investing in computing resources. We suggest a refinement of the 
neighborhood computation. 

The function DNeighbor .mchooses at random one element in w and assigns a probability of 
1/2 to the new solution, or removes it with the same probability. As there is no check whether or 
not the element already pertains to the solution we come up with, a high percentage of “neighbor” 
solutions remains unmodified. To avoid this we consider the partition J U J of Q where J is the set 
of elements in the current solution and J the set of elements not in the solution. We then consider 
with identical probability the following three situations: i) remove one element from J; ii) add one 
element from J to J; iii) exchange two elements between J and J. This is implemented in the 


function DNeighbor2.m. 
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Listing 12.33: C-HeuristicsNutshell/M/./Ch12/DNeighbor2.m 


1] function x1 = DNeighbor2(x0,ignl1,Data,ign2) 

2|% DNeighbor2.m -- version 2018-10-28 

3|x1 = x0; n = Data.n; 

4| 7d = find(x0); nonJd = find(x0==0); done = 0; 
5| while ~done 

6 u = rand; 

7 IE wey 

8 m = numel (J); 

9 if m > 1 

10 k = unidrnd(m, 1); 

11 j = J(k); x1l(j) = 0; done = 1; 
12 end 

13 elseif u < 2/3 

14 m = numel(nonJ) ; 

15 ifm<n 

16 k = unidrnd(m,1) ; 

17 j = nonJ(k); x1(j) = 1; done = 1; 
18 end 

19 else 

20 m = numel (J); k = unidrnd(m,1) ; 

21 jout = J(k); xl(jout) = 0; 

22 m = numel(nonJd); k = unidrnd(m,1); 

23 jin = nonJ (k); xl(jin) = 1; done = 1; 
24 end 

25| end 


Executing TASSPpar with NF set to DNeighbor2 appears to perform particularly well as we 
obtain sets of cardinality 2 with a number of steps nS reduced by a factor of 200. With respect 
to the execution with nS = 80,000 we are about 40 times faster for solutions verifying global 
minimum cardinality. Parallel execution seems no longer be needed in this case since the overhead 
for opening and closing the parallel pool is already much larger than execution time. 


>> TASSPpar 
Starting parallel pool (parpool) using the ‘local’ profile ... connected 
to 2 workers. 
TA = 
OF: ‘SSP’ 
SS: 'SlogVec’ 
NF: ‘DNeighbor2’ 
Restarts: 5 
Rounds: 6 
Steps: 400 
scale: 0.2000 
ptrim: 0.8000 
upctl: 0.9000 
Percentiles: [0.9000 0.7200 0.5400 0.3600 0.1800 0] 
the. FELII 9T 672400751 25.0] 
Elapsed time 0.8 sec 


* Sol = 1000 92% 3 el.: 3 491 506 

x Sol = 1000 24% 3 Shee BOC 153« 817 

x Sol = 1000 41% 2 el.: 722 278 

* Sol = 1000 25% 5 el.: 72 172 73 632 51 
* Sol = 1000 20% 4 el.: 321 3 641 35 


Parallel pool using the ‘local’ profile is shutting down. 


Going off-road with the computations to a cluster, cloud or grid is an interesting alternative for 
solving large problems. In such cases we have to use batch mode. To submit TASSPpar .m for 
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batch execution, line 6 which produces the output has to be replaced by a save Resfile.mat 
command which saves all variables in the workspace to file Resfile.mat. After copying this 
file to the local machine the output is obtained by executing the following two lines. 


>> load Resfile.mat 
>> TA, ResLogVec, plotRes 


Appendix 12.C Heuristic methods in the NMOF package 


As of version 1.5, the NMOF package comes with implementations of Local Search, Threshold 
Accepting, Simulated Annealing, Differential Evolution, Particle Swarm Optimization, and Genetic 
Algorithms. This appendix briefly showcases the functions and their use. 


[attach-pkg] > library ("NMOF") 
> set.seed (2345567) 


12.C.1 Local Search 


Local Search is implemented in function LSopt. 


[LSopt-example] > ## Aim: find the columns of X that, when summed, give y 


ie <= RUE Op = iavlySia]| 
y <- rowSums (Xt) 
Deea <= Ii Gin =, we 47, me = me, ire = mie, ia = ih) 


> ## random data set 

> ne <- 251 ## number of columns in data set 
Sie <= Si) # number of rows in data set 

> howManyCols <- 5L ## length of true solution 

SOS Ko aiey oaae (ines) , Clin = aac, Aare) ) 

> XTRUE <- logical (nc 

> xTRUE[sample(1L:nc, howManyCols)] <- TRUE 

> 

> 

> 


## a random solution x0 ... 

> makeRandomSol <- function(nc) { 

ii <- sample.int(nc, sample.int(nc, 11) ) 
x0 <- logical (nc) 

xO[ii] <- TRUE 

x0 


Vv 


} 


> x0 <- makeRandomSol (nc) 


> ## ... but probably not a good one 
> abs(sum(y - rowSums(X[, xTRUE, drop = FALSE]))) ## should be 0 


[1] 0 


> abs(sum(y - rowSums(X[, x0, drop = FALSE]))) 


[1] 13.4 


> ## a neighborhood function: switch n elements in solution 
> neighbour <- function(xc, Data) { 

xn <- xc 

p <- sample.int(Data$nc, Data$n) 
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xn[p] <- !xn[p] 

ma Cewa) << Lit) 
xn <- xc 

xn 


> ## an objective function 
> OF <- function(xn, Data) 
abs (sum(Data$y - rowSums(Data$X[, xn, drop = FALSE]) ) ) 


> ## LOCAL SEARCH 
= le cetti <= biset = 50006, 
neighbour = neighbour, 
OEEO 
printBar = FALSE, 
printDetail = FALSE) 
> solLS <- LSopt(OF, algo = 1ls.settings, Data = Data) 
> solLS$OFvalue ## the true solution has OF-value 0 


[1] 0.798 


12.C.2 Simulated Annealing 


Simulated Annealing is implemented in function SAopt. We can reuse the functions we defined 
for LSopt. 

Note that we set nI in the settings of LSopt: it fixes the number of iterations, which makes 
comparing runs of LSopt, SAopt and TAopt (below) easier. 


> solSA < SAopt(OF, algo = 1ls.settings, Data = Data) [SAopt-example] 
> solSASOFvalue ## the true solution has OF-value 0 


[1] 0.00403 


12.C.3 Threshold Accepting 


Threshold Accepting is implemented in function TAopt. 


> solTA <- TAopt(OF, algo = 1ls.settings, Data = Data) [TAopt-example] 
> solTASOFvalue ## the true solution has OF-value 0 


[1] 0.00131 
Let us increase the number of iterations. 
> 1ls.settings$nI <- 100000 [TAopt-example2] 


> solTA <- TAopt(OF, algo = 1ls.settings, Data = Data) 
> solTASOFvalue ## the true solution has OF-value 0 


[1] 4.69e-06 


12.C.4 Genetic Algorithm 


Optimization via a Genetic Algorithm is implemented in function GAopt. 
We stay with the problem defined before. 


[GAopt-example] 


[DEopt-example] 


[PSopt-example] 
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> ga.settings <- list(nP 500, 
ae = 100), 
nB = DataS$nc, 
printBar = FALSE) 
> solGA <- GAopt(OF, algo = ga.settings, Data = Data) 


Genetic Algorithm. 
Best solution has objective function value 0.0014 ; 
standard deviation of OF in final population is 0 


> solGASOFvalue 


[1] 0.0014 


12.C.5 Differential Evolution 


Differential Evolution is implemented in function DEopt. 


> ## Example: Trefethen’s 100-digit challenge (problem 4) 
> ## http://people.maths.ox.ac.uk/trefethen/hundred.html 


> de.settings <- list(nP = 50L, ## population size 
nG = 1000L, ## number of generations 
in = 0.5, ## step size 
GR = TONOS ## prob of crossover 


min = c(-10, -10), ## range of initial 
ies = ol o O sisi: joxrcromlereacin 
printBar = FALSE) 
> solDE <- DEopt (OF = tfTrefethen, ## see ?testFunctions 
algo = de.settings) 


Differential Evolution. 
Best solution has objective function value -3.31 ; 
standard deviation of OF in final population is 1.68e-16 


> ## correct answer: 
> # -3.30686864747523 
> noquote(format (solDESOFvalue, digits = 12) ) 


[1] -3.30686864748 


> ## check convergence of population 
> sd(solDESpopF) 


[1] 1.68e-16 


12.C.6 Particle Swarm Optimization 


Differential Evolution is implemented in function PSopt. 


> ps.settings <- list(nP = 50L, ## population size 
nG 1000L, ## number of generations 
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min = c(-10, -10), ## range of initial 
max = c( 10, 10), ## population 
printBar = FALSE) 


> solPS <- PSopt(OF = tfiTrefethen, ## see ?testFunctions, 
algo = ps.settings) 


Particle Swarm Optimisation. 
Best solution has objective function value -3.31 ; 
standard deviation of OF in final population is 1.37 


> ## correct answer: 
> # -3.30686864747523 
> noquote(format(solPSS$OFvalue, digits = 12)) 


[1] -3.30686864748 


12.C.7 Restarts 


Finally, we should mention the function restartOpt, which works for all the optimization func- 
tions in NMOF. Let us use it for the first example. 


> sols.ls <= restartOpt(LSopt, n = 100L, 
OF = OF, 

algo = 1s.settings, 
Data = Data) 


> sols.sa <- restartOpt(SAopt, n = 100L, 
OF = OF, 
algo = 1ls.settings, 
Data = Data) 


> sols.ta <- restartOpt(TAopt, n = 100L, 
OF = Of, 
algo = 1ls.settings, 
Data = Data) 

We extract the objective function values of the restarts. 


> summary(sapply(sols.ls, ‘[[*‘, "OFvalue") ) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.003 0.230 07352 0.393 0.545 0.920 


> summary(sapply(sols.sa, ‘[[‘, "OFvalue") ) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.000000 0.000005 0.000107 0.000118 0.000151 0.000536 


> summary(sapply(sols.ta, ‘[[*‘, "OFvalue") ) 


[restartOpt-example] 


[restarts] 
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Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.000000 0.000000 0.000005 0.000051 0.000107 0.000475 


Note that both Simulated Annealing and Threshold Accepting manage to reach the optimum zero 
value. 


Chapter 13 


Heuristics: a tutorial 


Contents 

13.1 On Optimization 319 13.4.2 Local Search 331 
13.1.1 Models 319 13.4.3 Threshold Accepting 336 
13.1.2 Methods 320 13.4.4 Settings, or: how (long) to run an 

13.2 The problem: choosing few from many 320 algorithm 340 
13.2.1 The subset-sum problem = 13.4.5 Stochastics of LS and TA 340 
13.2.2 Representing a solution =a 13.5 Application: selecting variables in a 
13.2.3 Evaluating a solution 321 regression 342 
13.2.4 Knowing the solution 322 12:54. Wineammodele 342 

13.3 Solution strategies 323 13.5.2 Fast least squares 344 
13.3.1 Being thorough 323 = : — 
13.3.2 Being constructive 324 13.5.3 Selection criterion 344 
13.3.3 Being random 325 13.5.4 Putting it all together 345 
13.3.4 Getting better 327 13.6 Application: portfolio selection 347 

13.4 Heuristics 330 13.6.1 Models 347 
13.4.1 On heuristics 330 13.6.2 Local-Search algorithms 348 


In this chapter, are going to tackle the so-called subset-sum problem. The chapter will be very 
hands-on: if you wish to use heuristics, i.e. you want to implement them or use implementations 
to solve your models, this chapter will provide you with all the information you need. (Otherwise, 
you may want to read Chapter 12.) We start with a brief discussion of optimization in general. 


13.1 On Optimization 
13.1.1 Models 


Optimization, as the term is used in this chapter and in this book, means solving a model. A model 

is a precise description of the actual problem, translated into software form and fed to a computer. 

The computer, in turn, returns a solution to the model. This chapter will be concerned only with 

solving models, not with how to come up with them. See the discussion in Chapter 10. 
Mathematically, an optimization model is written as 


minimize f(x). (13.1) 


In this formulation, x stands for the decision variables, i.e. the quantities that we can change. The 
function f is the so-called objective function: it maps our choice for x into a real number; the 
lower this number, the better is x. The solution of the model is the input x* that makes f as small 
as possible. It is customary to write models as minimization models; to maximize, just minimize 
— f . In most models, there are restrictions on how we may choose x. 

Writing the model in this way evokes the view of f as some mathematical function. We find it 
helpful to not think in terms of a mathematical description, but rather to replace f by something 
like 

solutionQuality = function(x, data). 
That is, we need to be able to program a mapping from a solution to its quality, given the data. 
There is no need for a closed-form mathematical description of the function.' Indeed, in many 


1. Mathematically a function is nothing but a mapping, so there is no contradiction here. But when people see f(x) they 
intuitively often think of something like f(x) =./x+ x? . We would prefer they thought of a program, not a formula. In fact, 
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[data] 


[seed] 
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applied disciplines there are no closed-form objective functions. The function f could include 
an experimental setup, with x the chosen treatment and f(x) the desirability of its outcome. Or 
evaluating f might require a complicated stochastic simulation, such as an agent-based model. 

Following this view, a solution does not necessarily mean that we really solved the model in 
the mathematical sense (and even less that we actually solved the original problem). The solution 
is simply the result that the computer returns. 


13.1.2 Methods 


There are many different ways and techniques to solve a model. It is important to keep model 
and solver separate. Historically, this has often not been done. Instead, models were shaped to fit 
the existing solvers. That was unfortunate: often, particular models have been bound up with their 
solution techniques, which obscured the models. 

As we said, there are many different ways to solve a model. But very roughly, we may dif- 
ferentiate between two approaches: constructive ones, and iterative ones. In constructive ones, we 
construct a single solution, typically by exploiting our knowledge about the model and the data. 
Iterative approaches take an initial solution and then change it—improve it—, often many times, 
until some stopping criterion is met. 

It will be easier to discuss all this with a concrete example. 


13.2 The problem: choosing few from many 
13.2.1 The subset-sum problem 


The subset-sum problem is conceptually simple: we are given a list of numbers, and our job is to 
find a subset of these numbers such that the sum of the numbers in the subset equals a specified 
target. We may not be able to find a subset whose sum exactly matches the target, so we settle for 
a sum close to the target. As close as possible, to be precise. 

You may doubt whether being able to solve such a problem would be useful in financial appli- 
cations. But at its core, the problem is about choosing a subset of quantities out of a much larger 
universe. And this kind of problem is ubiquitous: think of selecting assets to put into a portfo- 
lio, variables to put into a regression model, or time-series features to put into a machine-learning 
algorithm. We will see later that we can reuse the code for any such selection problem. 

Let us introduce some notation: we call X the list of numbers; we want to find a subset x € X 
such that ` x is close to a target sum so. There are n numbers in X. Later, when we show and 
describe code, we shall refer to X, x, s0 and n. 

Let us use concrete numbers: we set n to 100, and so to 2.? 


= m z 00L 
> X <- runif(n) 
> sO <- 2 


The L-suffix in 100L indicates that n is an integer. The elements of X are uniformly-distributed 
random numbers. 
Many of the computations that we run later will be stochastic, so we start by setting a seed. 


> set.seed (298359) 


Before we go on, we need to make some preparations: we need to think how to represent a 
solution x, and we need a method to judge the quality of such a solution. 


if the word function reminds you first of a function as the term is used in programming languages such as MATLAB®, R 
or Python, then you are on the right track. 

2. These were the numbers used in a discussion on the R-help list, which inspired this tutorial. See https://stat.ethz.ch/ 
pipermail/r-help/2010-January/226267.html. 
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13.2.2 Representing a solution 


Any element of X is either in one particular subset or it is not: thus, a natural representation of a 
solution is a logical vector of length n: 


TRUE FALSE FALSE FALSE FALSE 


Let us create a random solution: 


S se <=> seiaaie (in), & 0,5) [random-sol] 
> summary (x) 


Mode FALSE TRUE 
logical 58 42 


13.2.3 Evaluating a solution 


Now that we have a solution, let us compute how good it is: we need an objective function. This 
function takes a solution and maps it into a real number. The smaller this number is, the better. 
(Recall that if we wished to maximize a quantity, we would simply multiply the objective function 
value by —1.) To see how good x is, we compute its associated sum and see how far it is from sọ. 


> abs(sum(X[x]) - s0) 


PEJ TSSL 


Such a computation belongs into a function: the objective function. It takes a solution and the 
necessary data and returns the quality. 


> OF <- function(x, X, s0) [OF1] 
abs (sum(X[x]) - s0) 


S Ox, 26, 30) 


ELI LRL 


So our random solution is not very good. (Note that picking any single element from X would have 
given a better solution.) 

If all elements of x are TRUE, the objective function sums the complete X. What if all elements 
in x are FALSE? Luckily, R has a sensible convention for the sum of a zero-length vector: it is zero. 


> sum(numeric (OL) ) 


[1] 0 


Thus, an empty solution—do not pick any element of X—will have an objective function value of 
exactly 2; or, more generally, a value equal to the target sO. 


Sexe <i Ogea) 
= OPa X, SO) 


ELI: -2 


In the current version of OF, we pass all data as separate arguments. Let us write an alternative 
version of the objective function, which expects all data collected in a single list. In this way, if we 
later change code—in particular, add more parameters—, we need not add these new arguments in 
the code. We collect all objects to be passed to the function in the list Data. 


[Data] 


[xTrue] 


[checks] 
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= Date <= Iisie(@< = mumn (100), m= 100m, s0 = 2) 


Note in general that we follow a good rule in programming and have no magic numbers in our 
code. A magic number is any number other than 0 or 1. Whenever you see such a number written 
in the code, it is a sign to store it in a variable.* 

The new function is slightly more cumbersome, since we need to write Data$n instead of n, 
and so on; but you will see it is worth it. 


> OF <- function(x, Data) 


abs (sum(Data$X[x]) - Data$s0) 


The all-FALSE solution again. 


> OF (x, Data) 


[1] 2 


13.2.4 Knowing the solution 


When we later choose subsets from X, it would be great if we could compare them with the truly- 
best solution—the global optimum. But we do not know this optimum. 

There is a way to handle this: we create a problem instance whose solution we know, i.e. we 
will set up data X such that i) an optimal subset exists whose sum is exactly 2; and ii) we know the 
composition of this subset. We call its solution xTrue. In this way, we can easily check how good 
the solutions are that we obtain from different solution strategies. 

It is straightforward to do this: we randomly select a subset of X and then scale its elements so 
that their sum equals s0. We first select the elements to be put into the true subset. 


> ## sample elements included in xTrue 
> true <- sort(sample(seq_len(Data$n), 


(assume n > 2) 


sample(2:Data$n, 1))) 
= Erue 
[1] 6 8 10 14 23 28 29 30 44 47 49 51 
[13] 53 55 58 62 69 70 72. 80 81 90 93 95 
[25] 96 97 100 


Now we create xTrue. 


> xTrue <- logical (Data$n) 
> xTrue[true] <- TRUE 


Vv 


## scale sum of xTrue to be exactly 2 
Data$X[xTrue] <- DataS$X[xTrue]/sum(Data$X[xTrue]) * 
Datass0 


Vv 


Let us check. 


> sort (which(xTrue) ) ## should be the same as ‘true’ 
[1] 6 8 10 14 23 28 29 30 44 47 49 51 
[13] 53 55 58 62 69 70 72 80 81 90 93 95 


[25] 96 97 100 


3. In calendar applications, the numbers 12 and 7 may be admissible, too. 
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> sum(Data$xX[xTrue] ) ## should be 2 
[1] 2 
> OF (xTrue, Data) ## should be 0 
[1] 0 


Thus, the optimal solution will have an objective function value of zero. (See Chapter 2 why it may 
be not exactly zero.) The aim of all the strategies we discuss next is to find a subset whose objective 
function value is zero. 


13.3 Solution strategies 
13.3.1 Being thorough 


Here is one strategy: write down all possible subsets of X, compute the sum of each, and keep the 
subset whose sum is closest to 2. 

The problem with this approach is the number of combinations. Suppose you wanted a list of 
subsets of size 30, say. You may think that you could use choose to compute the number of 
possibilities. And it turns out you can. 


> noquote (sprintf ("%-30.0f", choose(Data$n, 30))) 


[1] 29372339821610843812921344 


But this number is not exact, of course: the number of possibilities is too large to fit into an integer. 
Fortunately, the R package Rmpfr can help. 


a lilore (( Mixtuonere)) 

> chooseMpfr(100, 30) 

1 ‘mpfr’ number of precision 115 bits 

[1] 29372339821610944823963760 

The package even has a handy function for computing all possibilities. 
> chooseMpfr.all(n = Datasn) 

Let us look at the number of combinations for several cases. 

Sia <= eU, sec(1O, 100, oy = 5)) 


z> alah 


[1] 1. 10. 15 20 25 30 35 40 45 50 55 60 
[13] 65 70 75 80 85 90 95 100 


> chooseMpfr.all(n = 100) [ii] 


20 ’mpfr’ numbers of precision 97 bits 
100 ## 1 out of 100 
17310309456440 ## 10 out of 100 
253338471349988640 ## 15 out of 100 
535983370403809682970 ## 20 out of 100 
242519269720337121015504 ## 25 out of 100 


[choose] 


[Rmpfr] 


[ii] 


[all-possibilities] 
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29372339821610944823963760 30 out of 100 
1095067153187962886461165020 35 out of 100 
13746234145802811501267369720 40 out of 100 
61448471214136179596720592960 45 out of 100 
100891344545564193334812497256 50 out of 100 
61448471214136179596720592960 55 out of 100 
13746234145802811501267369720 60 out of 100 
1095067153187962886461165020 65 out of 100 
29372339821610944823963760 70 out of 100 
242519269720337121015504 75 out of 100 
535983370403809682970 80 out of 100 
253338471349988640 85 out of 100 
17310309456440 90 out of 100 

75287520 95 out of 100 

1 100 out of 100 


Altogether there exists an incredibly large number of possibilities: 


> sum(chooseMpfr.all(n = 100) ) 


1 ‘mpfr’ number of precision 97 bits 
[1] 1267650600228229401496703205376 


If we could process a billion subsets per second, it would still take 4 x 10!* years to arrive at the 
answer. The sheer size of this number should also make clear that faster hardware or running things 
in parallel cannot help: speeding up the computation by a factor of 1000, say, would still leave us 
with more than 40 billion years. And that is only for a small problem size of 100. 

In sum, we can never expect to go through all subsets. So, we need to turn to computational 
strategies other than complete enumeration to come up with good solutions to the subset-sum prob- 
lem. Strategies that are more selective. 


13.3.2 Being constructive 


Actually, even without much optimization, we can think of ways to find a good solution. We know 
that the elements of X are small positive numbers. So here is one way to arrive at a solution: 
iteratively sum the elements of X; when you exceed s0, stop and keep this solution. Given our 
model and the data, we know that such a solution cannot be totally bad. 


> i <- which(cumsum(Data$X) > Data$s0) [1] 
> xConstr <- logical (Data$n) 


= sdComsicic [ile] <= Wales 
> OF(xConstr, Data) 


[1] 0.258 


> xConstr[i] <- FALSE ## slightly below 2 
> OF(xConstr, Data) 


[1] 0.259 


That is definitely better than an average random solution. We may be able to do even better by 
summing preferably small numbers; so we sort the numbers in X first. 


> ii <- order (Data$X) 
> i <- which(cumsum(Data$X[ii]) > Data$s0) [1] 
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> xConstr <- logical (Data$n) 
> Constr [ata [ileal] <= WRU 
= Or GERTE, DACE) 


[1] 0.0885 


> xConstr[ii[i]] <- FALSE ## slightly below 2 
> OF(xConstr, Data) 


[1] 0.0504 


Such a solution strategy is called constructive since we build—we construct—one single solution. 
Once we have this solution, we are done. Constructive methods may not be optimal, but they often 
result in good solutions. At the very least, they can serve as useful benchmarks. For an application, 
see Example 13.1. One downside is that this approach is not a general strategy: it only works for this 
specific problem. But nevertheless, for many problems we can use knowledge about the problem to 


come up with good solutions. 


Example 13.1 The simplest way to compute the minimum variance portfolio 


The long-only minimum-variance portfolio is computed by generating a variance—covariance matrix of 
the returns of all assets, and then finding the portfolio weights that minimize the variance of portfolio 
returns. Schumann (2013a) proposes a simpler, constructive method: sort the assets by their variances, 
and then choose an equal-weight portfolio of the N assets with the lowest variance. In the paper, N 
is set to 20. This sorting approach completely ignores the correlations between assets. In a simulation, 
Schumann (2013a) shows that in-sample sorting is—unsurprisingly—worse than the standard approach 
of using the full variance-covariance matrix. But the advantage of the standard approach fades out- 
of-sample when there is diversity in the cross-section of assets (i.e. some assets have low, and some 
assets have high variances), and we cannot precisely predict future covariance. Both of these assump- 
tions are empirically valid. Then, the sorting rule is rarely worse (and if, not much) than the textbook 
approach, but even sometimes better. Gilli and Schumann (2017) provide empirical confirmation that 
the sorting rule results in portfolios whose volatilities are comparable with portfolios that use the full 
variance—covariance matrix. 


13.3.3 Being random 


Ken Thompson is said to have suggested that “When in doubt, use brute force.” The corresponding 
search strategy is to randomly choose subsets. Random sampling has a number of advantages: it 
is simple; it benefits from more computing power; and it can be distributed. But random sampling 
is also the least-efficient method we can think of. (At least the least-efficient method among the 
class of search strategies that really aim to find good solutions. Clearly, if we wanted to be bad on 
purpose, we could easily be less efficient.) See Example 13.2. 


Example 13.2 Random Sort (a.k.a. bogo-sort) 


In Chapter 1, we suggested that any model can be written (or thought of) as an optimization problem. 
Similarly, many other computations may be written as optimization models. Think of sorting a vector: 


> z <- sample(5) 
> Z 
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[1] 523 41 


The function randomSort is an example implementation of a horribly-inefficient algorithm: ran- 
domly permute the vector; if is sorted, stop; otherwise permute again. 


> randomSort <- function(x) { 
while (is.unsorted (x) ) 
x <- sample (x) 


} 
Let us see why it is a horribly-inefficient algorithm. 


> library ("rbenchmark" ) 
> benchmark (eandomS ome (Seels))) ile Sal 


test replications elapsed 
1 randomSort (3:1) 100 0.004 


> benchmark (randomSort(5:1))[, 1:3] 


test replications elapsed 
1 randomSort (5:1) 100 0.049 


> benchmark (randomSort(7:1))[, 1:3] 


test replications elapsed 
1 randomSort (7:1) 100 1.93 


The time it takes to sort even very short vectors explodes. 


The function randomSo1 randomly selects k elements of X. 
[randomSol] > randomSol <- function(Data) { 
x <- logical (Data$n) 
k <- sample(Data$n, size = 1L) 
x[sample(Data$n, size = k)] <- TRUE 


x 


} 
This is very fast, and very likely leads to bad solutions. 


> OF(randomSol (Data), Data) 


base 6.x5 


> OF(randomSol (Data), Data) 


[1] 18.4 


> OF(randomSol (Data), Data) 


ELI 20:3. 
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FIGURE 13.1 The upper panel shows the left tail of the distribution of objective function values for a sample of random 


solutions. We sample uniformly, and thus the distribution function resembles a straight line. The lower panel shows the 
distribution for the best 100 of the one million random solutions. 


Let us create one million random solutions instead. 


trials <- le6 
OFvalues <- numeric(trials) 
S@limiencm@s <= veckor (Tilist", lengra = erilals) 
for (i in seq_len(trials)) { 
solutions[[i]] <- randomSol (Data) 
OFvalues[i] <- OF(solutions[[i]], Data) 


MOVEN 


} 


As Fig. 13.1 shows, some random solutions are very good, while others are—as expected, 
quite bad. From these random solutions, we keep the 100 best ones, as a benchmark. We also store 
the very best random solution as xRandom. 


> best100 <- order (OFvalues) [1:100] [best-random] 
> random.OF <- OFvalues[best100] 


The objective function values of the best random solutions are plotted in Fig. 13.1. 


13.3.4 Getting better 


Now we turn to the strategy that is associated with optimization in a narrower sense: iterative im- 
provement. Its essence is to take an existing solution and change it in a way that improves the 
solution; this improvement step is then repeated several—often many—times. The following pseu- 
docode makes the idea of an iterative method more precise. 


: generate initial solution x° 
while stopping condition not met do 
create new solution x” = N(x‘) 
if A(x", f(x"), f (x°)) then 
xo =x" 
end if 
end while 
return x° 


wn 


0) at Ow av 


In words: we start with a solution x°, typically randomly chosen. Then, in each iteration, the func- 
tion N makes a copy of x° and modifies this copy; thus, we get a new candidate solution x". The 
N stands for neighbor, and this meaning will become clearer shortly. The function A—as in ‘ac- 
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cept’—decides whether x" replaces x°. This function will typically look at and compare the quality 
of the two solutions, which is why we have written it as a function of x", f(x"), and f(x‘). Below, 
when it simplifies the presentation, we may write out the definitions of these functions (i.e. “inline” 
them). The process is repeated until a stopping condition is satisfied; finally, x° is returned. 

To implement a iterative method, we need to specify 


how we represent a solution x; 

how we evaluate a solution (the function f); 

how we change a solution (the function N); 

how we decide whether to accept a solution (the function A); 
when to stop. 


akhwnr = 


Note that for the subset-sum problem, we are already done with 1. and 2. It remains to discuss 
coming up with new solutions (3.); deciding whether to keep them (4.); and for how long to search 


(5.). 


Advanced Note When looking at the algorithm, two points deserve to be emphasized. 


e Solutions are handled through user-defined functions (f, N, A). If you like functional programming, 
you may see that they can be implemented without side-effects. 

e Since all interaction with a solution x is done through these functions, a solution can be any type of 
object, i.e. any data structure (not only a numeric vector). 


The algorithm presented and its building blocks would apply to a classical optimization method too. 
For a gradient-based method, say, x would be a numeric vector. The function N would evaluate the 
gradient at x° and then move minus the gradient with a specified stepsize; A would compute the 
objective function values of x° and x", and replace x° only if x" is better; if not, the search is 
stopped. 

We could also have written our random-sampling technique in this way: the function N could 
choose a new solution completely freely; and if the new solution is better than the old one, it 
replaces it. After a large number of trials, the procedure stops. But we said earlier that the N stands 
for neighborhood: the key idea is to keep a substantial part of the current solution, and not to replace 
it with something completely different (as a random search would do). Recall that a solution is a 
vector of logical values: 


TRUE FALSE FALSE FALSE FALSE 


Given such a solution, we may easily change it slightly and arrive at a new solution: 
TRUE FALSE [TRUE] FALSE FALSE 


Since we only make a small change to the solution, we call the modified solution a neighbor to the 
original one. 

There are different approaches to choose neighbors: we may choose a neighbor randomly; or 
we may be more systematic. In this section, we look at a systematic approach: greedy search. With 
such an approach, we check all neighbors and move to the best one. Heuristics, which we discuss 
in the next section, will be more random. 

For a greedy search, the function N now does a lot of work: it computes and evaluates all 
neighbors; it only returns the best neighbor. For our example, we define a neighbor as a solution 
that differs only in a single position from the current solution. A on the other hand is quite simple: 
if the best neighbor is better than the current solution, the neighbor is accepted and replaces x°. The 
search stops when there is no further improvement.’ Note that this search may well depend on the 
starting value. 


1: generate initial solution x° 
Taa 


3: x* = x° 


4. In a software implementation we will always introduce an upper limit to the number of iterations. 


Heuristics: a tutorial Chapter | 13 329 


4: while stopping condition not met do 
Sit. Ieee 

6: fori =1 to length(x‘) do 

7 create new solution x? from x° 
8 if f(x?) < f* then 

9: KZA 

10: * = f(x?) 

11: end if 

12: end for 

13: end while 

14: return x* 


The function greedy implements such a search in R. It takes as inputs an objective function, 
an initial solution x0, and a limit for the number of iterations (maxit). 


= Gch? <= aaee ota 20, sag, eoa = OOE 1 [greedy] 
done <- FALSE 
xbest <- xc <- x0 
xbestF <- xcF <- fun(xbest, ...) 
ic <- 0 


while (!done) { 
aie (IG > MELE) 
break 
else 
ie <- ic + 1L 


done <- TRUI 
xc <- xbest 
for (i in seq_len(Data$n)) { 
xn <- xc 
saaba <= san ae] 
Rael <=> setae, 4 5.) 
if (xnF < xbestF) { 
xbest <- xn 
xbestF <- xnF 
done <- FALSE 


tra 


} 
ligt (Gdosse = -dyser One Ne = Sdeasici, auc = aie) 
} 


The function returns the best solution, the associated objective function value and the number 
of steps it has taken before it has stopped. Let us try it. 


> x0 <- randomSol (Data) [run-greedy] 
= restile < grecdy, (fun — Ory SO = SA), 
Data = Data, maxit = 1000L) 
> trials <- 100 
> greedy.ic <- greedy.OF <- numeric(trials) 
> greedy.solutions <- vector("list", length = trials) 


> for (i in seq_len(trials)) { 
g <- greedy(fun = OF, 
x0 = randomSol (Data), 
Data = Data, 
maxit = 1000L) 


[xGreedy] 


[greedy-steps] 
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FIGURE 13.2 Distributions of objective function values of best 100 random solutions and 100 greedy solutions. Best 
possible solution would be zero. 


greedy.ic[i] <- g$ic 

greedy.OF[i] <- g$OFvalue 

greedy.solutions[[i]] <- g$xbest 
} 


We extract the best greedy solution. 


> xGreedy <- greedy.solutions[[which.min(greedy.OF) ] ] 
> OF (xGreedy, Data) 


[1] 1.62e-07 
It also interesting to look at the number of iterations the algorithm has taken. 


> summary (greedy.ic) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
2.0 10.0 26.5 29.4 47.0 68.0 


As you can see in Fig. 13.2 greedy search works quite well, too. Now that we have benchmarks, 
let us move to heuristics. 


13.4 Heuristics 
13.4.1 On heuristics 


The word heuristics in used in many different disciplines: mathematics, psychology, judgment and 
decision making, computer science, artificial intelligence and many more. Consistent with Wittgen- 
stein’s Sprachspielen, the word has different but related meanings; usually it is associated with 
optimization, rules of thumb, and search. One key similarity is that it is always about things that 
cannot be proved in a mathematical sense. 

In this chapter, and elsewhere in this book, we use the term heuristic in a narrow sense: as 
numerical optimization techniques that can solve models. 

For intuition-building: Suppose we have an optimization model and some solution x, perhaps 
randomly chosen. Above we described several rules to repeatedly change x. Actually, we mentioned 
quite extreme rules: on one hand, we had random search, which would not use any information 
about the problem, and also no information about the search (i.e. its own past). On the other hand, 
we had greedy search, which tried to improve in every step. And not just improve, but to improve in 
the best possible way. (An aside: gradient search as described above is the continuous counterpart 
of greedy search. Going minus the gradient is, not without reason, also called steepest descent.) 
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Heuristics are somewhere in between the extremes: they prescribe rules for changing x that are 
i) simple and ii) on average, or in expectation, improve f(x). The key then is that such a rule is not 
applied only once, but over and over again. That may be computationally intensive, but in the long 
run the solution is improved. As a result heuristics are (way) more efficient than random search, but 
at the same time do not suffer from the primary weakness of greedy methods: heuristics can walk 
away from local minima. We shall see that this is important. 

Heuristics use other, often simpler, mechanisms than classical techniques. In fact, two charac- 
teristics will show up in almost all methods. (i) Heuristics will not insist on the best possible moves. 
A heuristic may accept a new solution x" even if it is worse than the current solution. (ii) Heuris- 
tics typically have random elements. For instance, a heuristic may change x° randomly (instead 
of locally-optimally as in a greedy or gradient search). These characteristics make heuristics ineffi- 
cient for well-behaved models. But for difficult models (for instance, such with many local optima), 
they enable heuristics to move away from local optima." You will see examples of such algorithms 
in the coming sections (and chapters). 


13.4.2 Local Search 


Stochastic Local Search 


The term local search is, unfortunately, used both for a family of algorithms and for specific im- 
plementations. In this chapter, we will use the term Local Search (in title case) to mean a specific 
implementation, one that is sometimes also called Stochastic Local Search. We shall write LS as 
an abbreviation for this algorithm. When we talk about the family of techniques, we write local 
search in lower case; in fact, we will say local-search algorithms or something similar to make the 
meaning clear. LS is summarized in the following pseudocode. 


: generate initial solution x° 
while stopping condition not met do 
create new solution x" = N(x‘) 
if f(x") < f(x‘) then 
LS" 
end if 
end while 
return x° 


wn & 


PO: St GN E 


Note that we have not explicitly written the acceptance criterion as a function A because it is 
so simple: An LS is still greedy in the sense that to be accepted a new solution must be better (or 
at least not worse) than the current one. But LS adds a first ingredient that is vital to heuristics: 
randomness. Indeed, when creating a new solution—a neighbor—, an LS simply picks randomly 
one element of the current solution and switches it. We write a small function to do that. 


> neighbour <- function(x, Data) { 
p <- sample.int(n = Datasn, 
size = DataSstepsize) 


<i] s= eia] 
x 


Advanced note Why did we choose sample.int instead of sample? The answer is speed; but 
before you make a habit of preferring one function over the other: the good habit is not to memorize 
rules, but to measure. 


5. In principle, because of such mechanisms a heuristic could drift farther and farther off a good solution. But practically, 
that is very unlikely because every heuristic has a bias towards good solutions. In Threshold Accepting, the method that we 
describe later in this chapter, that bias comes into effect because a better solution is always accepted, a worse one only if it 
is not too bad. Since we repeat this creating of new candidate solutions thousands of times, we can be very certain that the 
scenario of drifting-off a good solution does practically not occur. 


[stepsize] 


[random-walk] 
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Vv 


library ("rbenchmark" ) 

> benchmark(sample(100, 5), 

sample.int(100, 5), 

replications = 20000, ocer = Yrelariva) (|, 1g 


test replications elapsed relativ 
2 sample.int(100, 5) 20000 0.068 1.00 
1 sample(100, 5) 20000 0.086 1.26 


So sample is about 30% slower, for this case, on this platform. 


The neighbour function requires one parameter to be set, stepsize, which determines 
the number of elements to be changed at once. (You see how using a single list Data gives us 
flexibility in adding or removing arguments.) 


> Data$stepsize <- 1L 
> summary (x) 


Mode FALSI 
logical 100 


@ 


> summary (neighbour (x, Data) ) 


Mode FALSE TRU. 
logical 99 


e 


> Data$stepsize <- 2L 
> summary (neighbour (x, Data) ) 


Mode FALSE TRUE 
logical 98 2 


As already pointed out, the acceptance rule is simple: if the new neighbor is better than the 
current solution, it becomes the current solution. If the neighbor is worse, it is discarded. Thus, 
we are already done with the inputs to LS. Actually, before we run the algorithm, it is worth pon- 
dering why such a strategy would work. First, note the asymmetry in the acceptance criterion: 
worse solutions are never accepted, but better ones are. Thus, we have a bias towards better solu- 
tions. 

But we also have a requirement about the model and data; namely that a solution close to a good 
solution is also good. In other words, the quality of neighboring solutions must be correlated. To 
illustrate this point, we take a random walk through the data and plot it in Fig. 13.3. 


> x <- randomSol (Data) 
> randomWalk <- numeric (1000L) 


> for (i in seq_along(randomWalk)) { 
x <- neighbour(x, Data) 
randomWalk[i] <- OF(x, Data) 
J 


As you can see, the objective function value goes up and down (actually somewhat similar to a 
stock price), and adjacent values are similar. If you prefer numbers: 
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FIGURE 13.3 Objective function values of an unguided (random) walk through the search space. 
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FIGURE 13.4 Distributions of changes in objective function values when 1 (dark gray), 5 (gray), and 10 (light gray) 
elements are changed. 


> cor(randomWalk[-1L], [cor-rw] 
randomWalk[-length (randomWalk) ] ) 


[1] 0.961 


The degree of correlation will depend on the step size. To show this, we create random solutions, 
and generate neighbors that differ by 1, 5 and 10 elements. Distribution functions of the resulting 
differences in objective function value are shown in Fig. 13.4. 

In general, larger steps in the neighborhood function should lead to larger changes in the ob- 
jective function. The aim in an LS (as well as in other heuristics) is to create meaningful variation 
across iterations, but at the same time keep solution quality correlated. Because of the bias towards 
better solutions, this will mean that over time the algorithm is going to improve the initial so- 
lution, because it will move the current solution towards regions of the search space with lower 
objective function values. (We should note here that what is meaningful in terms of variation 
is determined by the application.) The following three scatter plots demonstrate this correla- 
tion. 
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FIGURE 13.5 No structure in the objective function. In search spaces such as the ones pictured a Local Search will fail, 
because the objective function has no local structure, i.e. it provides no guidance. Being close to a good solution cannot be 
exploited by the algorithm. 


We can make this point clearer by looking at the opposite case: what if there is no correla- 
tion between close solutions, i.e. when there is no local structure? The following figure shows no 
correlation. See also Fig. 13.5. 
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Running a Local Search 


The NMOF package provides a function LSopt that implements a Local Search. Its interface is 
quite simple: 


LSopt (OF, algo = list(), ...) 


OF is objective function; algo is a list of settings and information required for running the al- 
gorithm. Model-specific data, which may be required to evaluate the objective or the neighborhood 
function, can be passed via the . . . argument. In our case, the list Data would be passed in this 
way. 

A short but functional version is provided as the function LSopt ., in which the . suffix serves 
as a mnemonic that it is an abbreviated function. 


= LWSCisit. <= ivinecicm(Or, algo = lisic()), oo.) 4 
xc <- algos$x0 
adel? <= (Ol (G48, 25.) 
for (s in seq_len(algo$nS)) { 
xn <- algoSneighbour(xc, ...) 
Sal’ <= Ouray, 4 5 5) 
aie (Saal <= o f 
xc <- xn 
XCF <- xnF 
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} 
list (xbest = xc, OFvalue = xcF) 
} 


To run it, provide the objective function OF; and a list with an initial solution, the neighborhood 
function, and the number of steps nS. 


= LESSE- (OF, 
list (x0 = randomSol (Data), 
neighbour = neighbour, 
as; = 50000), 
Data = Data) $OFvalue 


[1] 0.0106 


The function LSopt . has the same interface as and is fully compatible with LSopt as provided 
by NMOF, i.e. you may rerun the examples with LSopt as provided by NMOF. (The other way 
around will not work, as LSopt has many more options.) 

Local Search still depends on the starting value. However, running the method twice with the 
same starting value may result in different solutions. We demonstrate this by running the method 
two times: 


> library ("NMOF" ) [run-LSopt] 
x0 <- randomSol (Data) 
= algo <= list t0 = >), 
neighbour = neighbour, 
printBar = FALSE, 
rays} = 50000) 
> soll <- LSopt(OF, algo, Data = Data) 


Vv 


Local Search. 

Initial solution: 19.6 
Finished. 

Best solution overall: 0.0133 


> SOI <- Ise (@Om, alojen DECE = DEAETE) 


Local Search. 

Initial solution: 19.6 
Finished. 

Best solution overall: 0.00948 


Note that we used the LSopt function here. It returns a list, which contains a matrix Fmat. 
This matrix always has two columns and as many rows as there were iterations. 


> dim(soli$Fmat) [Fmat] 


[1] 50000 2 


The first column contains the proposed solution over all iterations; the second column contains 
the accepted solutions. Thus the second column shows us the progress of the search, which we plot 
in Fig. 13.6. 

We shall later analyze these results in more detail. But now, let us move right on to the next 
heuristic, one that builds on Local Search. 


[TAopt. ] 
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FIGURE 13.6 Objective function value for two runs of Local Search over 50,000 iterations (log scale). 


13.4.3 Threshold Accepting 


Threshold Accepting makes only a small modification to LS. It changes the acceptance rule A: if 
the new solution is better, accept it; but if it is worse, do still accept it, as long as it is not too bad. In 
fact, accept it as along as its badness does not exceed a certain threshold, which is why the method 
is called Threshold Accepting. 

Introducing such thresholds may be a small change, but it is fundamental, because we now 
accept solutions that are worse than the previously best one. This behavior forces us to adapt our 
algorithm, since we now need to keep track of the best solution found: 


generate initial solution x° 
set overall-best solution x* = x° 
while stopping condition not met do 
create new solution x" = N(x‘) 
if A(x", f(x"), f (x°)) then 
x =x" 
update overall-best solution x* 
end if 
end while 
return x* 


$2 00S ON a SO ON ES 


S 


In the implementation we will typically use a sequence lln.. of decreasing thresholds: initially, 
we allow the algorithm to move freely; later, it is forced to become stricter. 

Threshold Accepting is implemented in NMOF in the function TAopt, whose interface is the 
same as that of LSopt. 


TAopt (OF, algo = list(), ...) 


We also provide an abbreviated version. 


S eGo <= muner Oa aleo = aese s2.)) 
xbest <- xc <- algoş$x0 
PSSE <= sdel <= OEE 5 4.) 
for (t in seq_along(algoSvT)) { 
for (s in seq_len(algoSnS)) { 
xn <- algoSneighbour(xc, ...) 
SP SOM Cay Se a) 
aie, (Saal? <= seein ch SLOS E 
xc <- xn 
XCF <- xnF 
ise (odoin <= asri) 4 
xbest <- xn 
xbestF <- xnF 


threshold values. For intuition, suppose the thresholds were very large; then, our optimization algo- 
rithm would become a random walk and any new solution would be accepted. On the other hand, 
if the thresholds were too small, the algorithm would be too restrictive, and become stuck in local 
minima; for zero thresholds, we get exactly a Local Search. So the thresholds need to be connected 
to the step size for the algorithm, which is defined by the neighborhood. 


all numbers in X are uniformly distributed between 0 and 1; so let us fix some reasonable threshold 


} 
Lise Coet = Sel, 


To run TAopt ., we need to provide several pieces of data. One of those pieces are the actual 


We know of what magnitude typical changes in the objective function may be, because we know 


values. 


= 


= 


[1 


OFvalue 


= e) 


algo <- list(x0 = randomSol (Data), 
neighbour = neighbour, 
ins = 50000, 
WAR = EO. i, O40 
printBar = FALS 


printDetail = FALSI 


TAopt (OF, algo = algo, Data 


] 0.00391 


TAopt offers an option thresholds . only which will only compute thresholds, but not run 


the actual optimization. 


Wi NE ONE NYE AY 


Li 
$ 


Ur Ur Vt We 


algo$vT <- NULL 
algo$nT <- 10 


algoS$thresholds.only <- TRUE 


thresholds <- TAopt(OF, algo = algo, Data = Data) 
str (thresholds) 
st of 7 

xbest : logi NA 

OFvalue : logi NA 

Fmat : logi NA 

xlist : logi NA 

vT : Named num [1:10] 0.895 0.772 0.654 .. 
z= attr(*, "names")= chr [1:10] "45%" "40%" "35%""., 
initial.state: int [1:626] 403 366 648723493 -1416.. 
x0 : logi [1:100] TRUE TRUE FALSE FALSE 


> algo$vt <- thresholds$vt 
> algoSthresholds.only <- FALSE 
(TAopt (OF, algo = algo, Data 


= 


[1 


] 0.00117 


2, 0), 


E, 


AI 
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= Data) $OFvalue 


= Data) $OFvalue) 


[algo] 


[thresholds-only] 


[x0-list] 


[OF2] 


[neighbour2] 
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Example 13.3 


Updating the objective function 

There is a story (or legend) that the German mathematician Gauss, when he was in elementary 
school, was given the task to add the integers from 1 to 100. Being the fantastic mathematician that he 
was, Gauss came up quickly with the answer: 5050. He had realized that 1 + 100 = 101, 2+ 99 = 101, 
3 +98 = 101, and so on. Thus, 50 x 101 gave the answer. 

What Gauss had done was to exploit the structure in the problem, which allowed him to compute 
its solution more speedily. We can do the same here. Recall how both Local Search and TA handle the 
problem: a solution x is passed to the objective function, which evaluates to a number. The neighbor- 
hood function takes a solution x and returns a new solution. Since we provide both functions, we may 
as well change what x is. So far, a solution was a logical vector, but we can as well make it a bundle of 


e a logical vector, and 
e the subset sum associated with this vector. 


What is the advantage? In iteration 1, we compute the sum of the whole initial subset. But in the 
second iteration, we change only a single element. Thus, the sum remains almost the same; we simply 
have to add or subtract a single number. Define a vector tp of length n whose elements are defined as 
follows: 


0 if unchanged, 
lp= 1 if added, 
—1 if removed. 


In iteration 1, we compute 

S,= 5D X subset 1 ; 
in iteration 2, we compute 

S2 = S1 + XQ : 
and in iteration i, we compute 

Si = Si—1 + xe . 


Let us try it in R. A random solution could now be created as follows. 


> tmp <- randomSol (Data) 
S 0) — ist. = tiie, 


sx = sum(Data$X[tmp] ) ) 


The new objective function does not sum, but merely extracts the sum from the solution. 


> OF2 <- function(x, Data) 


abs (x$sx - Data$Ss0) 


> OF2(x0, Data) 


[1] 25:4: 


> OF (x0$x, Data) ## check 


[2).225+,1 


Instead, the neighborhood function does the summing. 


> neighbour2 <- function(x, Data) { 


p <- sample.int(Data$n, size = DataSstepsize) 
x$x[p] <- !x$SxI[p] 
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x$sx <- x$sx + sum(Data$X[p] * (2 * xSx[p] - 1)) 
x 


} 


The updating mechanism itself comes with a certain overhead. So let us create a larger example to show 
its benefit. 


> Data$Sn <- 10000L {larger-size] 
> Data$X <- rnorm(Data$n) 


> set.seed (56447) [sol1-sol2] 
> x0 <- randomSol (Data) 
= algo <= lised = 30), 
printDetail = FALSE, printBar = FALSE 
neighbour = neighbour) 
SoLL s= GNejoie((Olr, Aloo = clilicfey, DENE = DECA) 
set.seed(56447) 
tmp <- randomSol (Data) 
x0 <- list(x = tmp, sx = sum(Data$X[tmp] ) ) 
aleo <> ligt = 3x0), 
printDetail = FALSE, printBar = FALSE, 
neighbour = neighbour2) 
> sol2 <- TAopt(OF2, algo = algo, Data = Data) 


WY ONE YE NE AY 


Both objective functions should agree on the quality of the solution. 


> OF( soliS$xbest, Data) [test-OF] 


[1] 0.00106 


> OF2(sol2$xbest, Data) 


[1] 0.00106 
The speedup. 


> full_sum <- expression({ [updating-speedup] 
x0 <- randomSol (Data) 
euler) s= lise led) = 3x0), 
printDetail = FALSE, printBar = FALSE, 
neighbour = neighbour) 
TAopt (OF, algo = algo, Data = Data) 


}) 


> updating <- expression ({ 
tmp <- randomSol (Data) 
30) <= Sre = ine, ch = Sui DEAS [eis] )) )) 
cule a IGE (0) = s0), 
printDetail = FALSE, printBar = FALSE, 
neighbour = neighbour2) 
TAopt(OF2, algo = algo, Data = Data) 


3) 


> library ("rbenchmark" ) 
> benchmark (full_sum, 
updating, 
replications = 5, order = "relative")[, 1:4] 
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test replications elapsed relative 
2 updating 5 0.908 1.00 
1 full_sum 5 3.568 3393 


13.4.4 Settings, or: how (long) to run an algorithm 


In all of the above examples, we just ran the algorithms and they somehow worked. How did we 
know the settings we have chosen, notably the number of iterations, were appropriate? That is easy 
to answer: we did not. As it turns out, the specific settings and choices—such as how a solution is 
represented, or with how many iterations an algorithm is run—matter a lot for heuristics. 

Unfortunately, there is no way to configure a general-purpose heuristic such that it works out- 
of-the-box for all cases. (Which may not be true for a specialized implementation.) The functions 
in NMOF, such as TAopt, come with default settings. But these are provided mainly to make it 
easier to get started with these methods, not to use them “as is”. It is easier to modify—to play 
around with—a program that actually runs. 

But not knowing the appropriate settings is not a problem: one of the principles we stated in 
Chapter | was to “go experiment’. Users simply need to give up the idea that they can run a method 
once with default settings and are done. Rather, users need to run small-scale experiments to see 
what settings are appropriate. 

The most important question is: how long to run the algorithm? We cannot use the rules of clas- 
sical optimization. Even if there was no improvement in the best objective function value for quite 
some time, we could not be sure that the algorithm should be stopped: heuristics were specially 
designed to be able to walk away from local minima, and this may take time. 

(There are exceptional cases: if the objective function is some kind of distance for which the 
best possible value is zero, we can stop the algorithm when we are at zero. Such a case we have in 
this chapter.°) 

Running experiments is quite simple. We set up the problem, fix the number of iterations to 
Iı, and run the algorithm 20 times, say. For each run, we store the solutions. Now we increase 
the iterations to J2, and again run the algorithm 20 times. For each setting, we check the obtained 
solutions (e.g., look at the distribution function of the objective function values). This is not a 
formal method; but it gives us an idea what the algorithm gives us for specific settings, and thus we 
can easily find settings that are good enough. John Tukey is often quoted for having said something 
like “the best thing is when the data give you so clear a message that you cannot ignore it.” In our 
experience, this is exactly what happens when you run such experiments; furthermore, the obtained 
settings work typically well for the given model even when used with new data. 


13.4.5 Stochastics of LS and TA 


Local Search and Threshold Accepting are stochastic methods. Thus running either method twice 
will likely result in two different solutions. Think of the result of the optimization as the realization 
of a random variable with an unknown distribution D. What exactly is the result of the optimiza- 
tion? The decision variables that our software returns. To keep it simple, in this chapter we only 
look at the objective function values that are associated with such a solution. 

Now, we do not know what D looks like, but that is no problem, as it is easy to sample from 
it. In fact, you have already seen distributions of samples: in Fig. 13.2, for example, we plotted the 
outcomes of repeated starts of greedy search. Now, we run such restarts i = 1,..., “restarts for LS 
and TA, and each time collect the objective function value associated with the solutions f;. The 
NMOF package provides a function restartOpt to help here. 


6. Sometimes the mechanics of a technique tell us that we can stop the algorithm. For instance, in the standard version of 
Differential Evolution, new solutions are linear combinations of old solutions. Thus, if the population has converged such 
that all members are the same, we can stop. 
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restartOpt(fun, n, OF, algo, ..., 
method = c("loop", "multicore", "snow"), 
mc.control = list(), cl = NULL) 


Here is an example how we could call it for both LS and TA. 


> cl <- makePSOCKcluster(10) ## set number of cores [run-ls-ta] 
> clusterSetRNGStream(cl, 42858) 

> ## 

> algo <- list(neighbour = neighbour, 


x0 = randomSol (Data), 
IE 200000, 
printBar = FALSE, 
printDetail = FALSE) 
> sols.LS <- restartOpt(LSopt, n = 100, OF, 
algo = Glile@, Darte = peta, cil = cil) 
> sols.TA <- restartOpt(TAopt, n = 100, OF, 
algo = cleo, Dare = Deita, cll = cil) 


> stopCluster (cl) 
We extract the objective function values. 


> sols.LS <- sapply(sols.LS, ‘[[‘, "OFvalue") [extract-ls-ta] 
> sols.TA <- sapply(sols.TA, ‘[[‘, "OFvalue") 


We summarize the objective function values. We also add summaries for the results of greedy 
search and the random solutions. 


> summary (greedy. OF) [summaries] 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.00000 0.00042 0.00075 0.00199 0.00198 0.01641 


> summary (random. OF) 


Min. ist Ou. Median Mean 3rd Qu. Max. 
0.000003 0.000446 0.001090 0.000961 0.001381 0.001950 


> summary (sols.LS) 


Min. ist Qu. Median Mean 3rd Qu. Max. 
0.00e+00 4.40e-06 1.05e-05 1.95e-05 2.45e-05 1.06e-04 


> summary (sols.TA) 


Min. ist Qu. Median Mean 3rd Qu. Max. 
0.00e+00 5.10e-06 1.10e-05 1.69e-05 2.17e-05 8.34e-05 


Note that the best solutions are somewhat similar. But the results for LS and TA are much more 
consistent. 

We also plot the distributions in Fig. 13.7. Suppose you had been tasked with solving the subset- 
sum problem, and need to choose a solution strategy. Essentially, obtaining a solution means you 
draw from the distributions in Fig. 13.7. We see that the solutions obtained by the heuristics are the 
best ones. 


[R-randomData] 
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1.0 
28 | rr 
0.6 
0.4 
e —®— Greedy search 
0.2 Random search 
S —®— Local Search 
0.0 —@— Threshold Accepting 
0.0000 0.0005 0.0010 0.0015 0.0020 


FIGURE 13.7 Distributions of objective function values of best 100 random solutions, 100 greedy solutions, 100 solutions 
obtained by Local Search and 100 solutions obtained by Threshold Accepting. Best possible solution would be zero. 


> min(greedy.OF) 

[1] 1.62e-07 

> min(random. OF) 

[1] 3.36e-06 

> min(sols.LS) 

[1] 1.32e-08 

> min(sols.TA) 

[1] 9.09e-09 

13.5 Application: selecting variables in a regression 


13.5.1 Linear models 


In this section we do a simple model selection for a linear regression:’ out of p available regres- 
sors, select a subset such that a given selection criterion is minimized. We start with a function 
randomData; it creates a dataset X of p available regressors with n observations. A number k of 
these regressors are the ‘true’ regressors, and they define a response variable y: 


y=XxB+se (13.2) 


The variable K is the set of true regressors (i.e. == length(K) ); thus, Xx are those columns 
of X that represent the true regressors. The number s scales the residuals. 


> randomData <- 


imonaKGEsLCia (je) = 200b; ## number of available regressors 
id, = BOO, ## number of observations 
maxReg = 10L, ## max. number of 
## included regressors 
G = il, ## standard deviation of residuals 


Comisicciae, = Aim )) if 


7. This example is taken from Schumann (201 1—2018a). We thank Victor Bystrov for comments on an earlier (MATLAB) 
version of this example. 


X <- array(rnorm(n * p), dim = c(n, 
if (constant) 

SC, cht) ee 
k <- sample.int(maxReg, 1L) 


K <- sort(sample.int(p, ## set 


betatrue <- rnorm(k) 


k) ) 


## the response 
y <= 2, 1] 


variable y 


%x% as.matrix(betatrue) 


ITSEGS S35, E V 
betatrue = betatrue, 
K Sik, m = it, = jo) 
J} 
We create a random dataset. 
> rD <- randomData(p = 100L, n = 200L, s = 
constant = TRUE, maxkeq 
We put all the data in a list called Data. 
S Dewe, h= SEO = TaDEXX, 
y = rD$y, 
DESEOS 
p = rD$p, 
maxk = 30L, ## max. number 
lognn = log(rD$n) /rDS$n) 


Next, we compute a random solution x0. 


> x0 <- logical (DataS$p) 

> temp <- sample.int (DataSmaxk, 1L) 
> temp <- sample.int(Data$p, temp) 
> 


xO[temp] <- TRUE 


+ rnorm(n, 
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## number of true regressors 


of true regressors 


## true coefficients 


sd = s) 


of regressors in model 


343 


Such a solution is a logical vector of length p which can be used to subset the columns of X. 
Clearly, x0 is not going to be a particularly good solution. But it will help us to test the code and 


demonstrate how it works. 
The true regressors. .. 


> rDSK 


[1] 15 38 40:55 59 61 79 93-94 
...and the random solution. 


> which(x0) 


[1] 2 4 6 8 14 15 19 20 24 35 39 46 47 
[18] 63 65 69 70 71 79 80 81 82 86 87 92 98 


13.5.2 Fast least squares 


48 51 54 56 


Any selection rule for a model will use the residuals of the fitted model as an ingredient. Thus, 
given a potential solution, we will have to compute a fit. Here we use Least Squares. Typically we 


[R-rD] 


[R-Data] 


[R-random-solution] 
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would use 1m for this. But 1m computes a lot of things that we actually do not need: we only need 
the fitted coefficients to compute the residuals. So instead of 1m, we use the function . 1m. fit. 
As a test, we compute the coefficients for the random solution x0. We also include a computation 
that uses the QR decomposition directly. 


result1 <- lm(DataSy ~ -1 + DataSX[, x0]) 

result2 <- gr.solve(Data$xX[, x0], DataSy) 

result3 <= .1lm.fit(DataSxX[, x0, drop = FALSE], DataSy) 

## =... coefficients should be the same 

all.equal (as.numeric(coef(result1)), 
as.numeric(result2) ) 


[R-coefficients] 


Ve eV LY, 


[1] TRUI 


@ 


> all.equal(as.numeric(coef(resulti1)), 
as.numeric (coef (result3) ) ) 


[1] TRUE 


A timing test. 


[R-timing] > require ("rbenchmark" ) 
> benchmark(.1m.fit(Data$SxX[, x0, drop = FALSE], DataSy), 
lm(Data$y ~ -1 + DataSX[, x0]), 
qr.solve(Data$X[ ,x0], DataS$y), 
columns = c("test", "elapsed", "relative"), 
order = relative, 


replications = 1000L) 


test elapsed relativ 


1 .1lm.fit(Data$X[, x0, drop = FALSE], DataSy) 0.078 1.00 
3 qr.solve(Data$X[, x0], DataSy) 0.110 1.41 
2 lm(Data$y ~ -1 + DataSX[, x0]) 0.600 7.69 


13.5.3 Selection criterion 


Now, for the actual selection criterion. We will use the Schwarz criterion, which is (for a linear 
model) given by 


; (13.3) 


n 


(==> maa) log(n) x number of regressors 
log + ; 
n 


see for instance Johnston and DiNardo (1997). We put this computation in the objective function 
OF. 


[R-OF] |> OF <- function(x, Data) { 
e <- .lm.fit(DataSx[, x, drop = FALSE], DataSy) $residuals 
log (crossprod(e)/Data$Sn) + sum(x) * DataSlognn 


} 


With the random solution. 
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S Ola), IDEN) 


The final ingredient that we need is a neighborhood function. It randomly chooses one element of 
a solution and switches its value, as we have seen before in the subset-sum problem. Note that the 
neighborhood function includes the constraints: we reject solutions that include no or more than 
Data$maxk regressors. 


> neighbour <- function(xc, Data) { 

xn <- xc 

ex <- sample.int(Data$p, 1L) 

xn[ex] <- !xn[ex] 

sumx <- sum(xn) 

if (sumx < 1L || sum: > DataSmaxk) 
xc 

else 
xn 


} 


Some random evaluations. 


> OF (neighbour (x0, Data), Data) 


> OF (neighbour (x0, Data), Data) 


> OF (neighbour(x0, Data), Data) 


13.5.4 Putting it all together 


We collect all settings for the algorithm, including the neighborhood function, in a list algo. Then 
we run TAopt. It should be finished pretty quickly. 


> algo <= lisage = O ## number of thresholds 
wS = 200E ## number of steps per threshold 
nD = 1000L, ## number of random steps to 
## compute thresholds 
neighbour = neighbour, 
x40) = s<0), 
printBar = FALSE) 
S poll <= MAIT (Oi, aloo = Ale, Dera = DALE) 


Threshold Accepting 


Computing thresholds ... OK 


[R-N] 


[R-restarts] 


[R-regression- 
restarts] 
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0.192 secs 


Estimated remaining running time: 


Running Threshold Accepting 
Initial solution: 2.88 
Finished. 


Best solution overall: 0.295 


We check the resulting solution’s objective function value sol1$OFvalue, and we compare 
the selected regressors with the true regressors. 


> sollsOFvalue 


[,1] 


Peg dit 0'.29'5 


> which(sollS$xbest) ## the selected regressors 


[1] 15 33 38 40 55 59 61 79 93 94 


> rDSK ## the true regressors 


[1] 15 38 40 55 59 61 79 93 94 


They are not the same. But in a relatively small sample we should actually not expect this to be the 
case. (You can increase n to see if the true model is eventually identified.) In fact, we can compare 
the value of the objective function for the true model and the selected model. 


> xtrue <- logical (DataSp) 
> xtrue[rDSK] <- TRUE 


> OF(soll$xbest, Data) 
1] 

[1,] 0.295 

> OF (xtrue, Data) 


[,1] 


[1,] 0.307 


We see that the Schwarz criterion for our selected model is lower than for the true model. 
Finally, we run a small experiment. Note that all runs use the same starting value x0. 


> restarts <- 100L 
algoSprintDetail <- FALSI 
> res <- restartOpt(TAopt, 

n = restarts, 
OF, 


V 


[za] 


= Glo steeulil (joes, oeus s riton iel) 
> ## extract solution quality and plot cdf 
> plot(ecdf(sapply(res, ‘[[‘, "OFvalue")), 
Case = Oo, manm = 9%, walelo = +t, xeleley = T4, 
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verticals = TRUE) 


1.0 Wp ge so og 
0.8 
0.6 
0.4 


0.2 f 


0.0 |-----~----~~~~-~~>~~-— 


0.28 0.29 0.30 0.31 


For each solution, we compute the objective function value, and also the selected regressors. 


> ## extract all solutions 

> xbestAll <- sapply(res, ‘[[‘, "xbest") 

> ## get included regressors 

> inclReg <- which(rowSums(xbestAll) > OL) 

> inclReg <- sort(union(rD$K, inclReg) ) 

> data.frame(regressor = inclReg, 
‘included’ = paste0(rowSums (xbestAll1) [inclReg], 

V/s eGeStases) 
‘true regressor?’ = inclReg sing rDSK, 
check.names = FALSE) 
regressor included true regressor? 

aly 15 100/100 TRUE 

2 33 100/100 FALSE 

3 38 100/100 TRUE 

4 40 100/100 TRUE 

5 55 100/100 TRUE 

6 59 100/100 TRUE 

7 61 100/100 TRUE 

8 79 100/100 TRUE 

9 93 100/100 TRUE 

10 94 100/100 TRUE 


Across the restarts, we get a relatively clear answer which regressors should, according to the 
Schwarz criterion, be put into the model. 


13.6 Application: portfolio selection 


In this section, we shall sketch how the code we wrote so far can be used to solve portfolio selection 
models. Asset selection with Local Search is already discussed in Section 14.3.1. So we rather show 
how TA may be used for continuous portfolio selection models. We also show that other features 
that we used in the subset-sum problem such as updating the objective function may be used in this 
application. 


13.6.1 Models 


A simple model is to minimize a performance measure, subject to a budget constraint, and minimum 
and maximum holding sizes: 


[R-overview] 
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w is a weight vector that needs to sum to unity (the budget constraint). We set wi to 5% for every 
asset. 

For f we use a simple measure of risk: squared portfolio return. You may be more used to 
minimizing variance, but actually squared return and variance are close relatives, as the following 
relation shows: 


1 
— R'R=Cov(R)+mm’. 
ns 


In the equation R is a matrix of with n, columns and ns rows. The vector m holds the column 
means of R. 

Suppose you wanted to solve this model with a standard mean-variance (i.e. a QP) solver. In a 
nutshell, this is what mean-variance optimization does: 


weights + returns — portfolio return 


w m m'w 


weights + covariance matrix —> portfolio variance 


w >» w' Xow 


This is a powerful idea: we aggregate information about single assets into information about the 
portfolio. But unfortunately, this idea cannot be applied to other models—it is rather special to 
mean-variance. So instead we are going to use a much more flexible framework: scenario opti- 
mization. We now consider R a scenario matrix: in its rows it stores the scenarios; in its columns, 
the assets. 


weights + scenarios — portfolio returns — any portfolio statistic 
w R Rw f(Rw) 


Note that this approach nests mean-variance: if we minimized variance in this approach, the 
solution would be the same as with the mean-variance approach. 


13.6.2 Local-Search algorithms 


Recall what we need to run a local-search algorithm: 


a representation of a solution Straightforward: a numeric vector w, the portfolio weights. 

an objective function Any function f (Rw). Note that we first compute a set of portfolio returns, 
and for such univariate returns we may easily evaluate any objective function, e.g. one based 
on drawdown, partial moments or correlation. 

a neighborhood function We use a variation of the neighbour function: pick two assets; 
increase one weight, and decrease the other weight by the same amount. In this way, 
as long as we start with a valid solution, the budget constraint is automatically en- 
forced. 

1: sete 

2: randomly select asset i 

3: set w; = wij — € 

4: randomly select asset j 

5: setwj =w;j +€ 
It is also easy to include minimum and maximum weights directly in the neighborhood func- 
tion, as you will see. 


We also listed two more decisions to make: the acceptance criterion for new solutions, and the 
stopping criterion. The acceptance criteria are given by the method we use: strict for Local Search; 
less strict for Threshold Accepting. The stopping criterion is, for both algorithms, as fixed number 
of iterations. 

The NMOF package comes with a dataset fundData, which provides 500 weekly return sce- 
narios for 200 funds. 


> dim(fundData) 
[1] 500 200 
> summary(apply(fundData, 2, sd)*sqrt(52) ) 


Median 
0.210 


Min. 
0.068 


ist Qu. 
0.135 


Mean 3rd Qu. 
0.207 0.265 


As before, we collect all information in a list Data. 


> Data <- list ( 
R = t(fundData), 
na = dim(fundData) [2L], 


Tavs) = laid ((iewaavelipreuecy) un] 

Glos! = OnTor ## stepsize 
Wiis = (0) 5(0)(0), 

wmax = 0.05, 

iesenniolle = sbhayeicaom(, eam) 


x[sample.int(length(x), 


The objective function needs to do two things: compute 
the function crossprod. 


> OF <- function(w, Data) { 
Rw <- crossprod(Data$R, w) 
crossprod (Rw) 
} 


Now the key part: the neighborhood. 


> neighbour <- function(w, Data) { 
toSell <- w > Data$wmin 


toBuy <- w < DataSwmax 
i <- DataSresample(which(toSell), size = 
j <- Data$Sresample(which(toBuy), size 


eps <- runif(1L) * DataSeps 


eps <- min(w[i] - DataSwmin, 
DataSwmax - wljl, 
eps) 

w[i] <- w[i] - eps 

LT <= wv e Gee 

WwW 


} 
It remains to run TAopt. 


> w0 <- runif(Data$na) ## a random solution 


> w0 <- w0/sum(w0) 


> algo <- list(x0 = w0, 
neighbour = neighbour, 
DS = S000L,; 
EEE ENOTE 
G= O 
printBar = FALSE) 


> res <- TAopt(OF,algo, Data) 
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Max. 
0.538 


## number of assets 
## number of scenarios 


ed) 


Rw, and then evaluate £ (Rw). We use 


Il 
jen 


L) 


[P-fundData] 


[P-Data] 


[P-OF] 


[P-run-TAopt] 
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Threshold Accepting 


Computing thresholds ... OK 
Estimated remaining running time: 10 secs 


Running Threshold Accepting 
Initial solution: 0.231 
Finished. 

Best solution overall: 0.00566 


We check the constraints. 


[P-check- > min(res$xbest) ## should not be smaller than Data$Swmin 
constraints] 


[1] 0 


> max(res$xbest) ## should not be greater than DataSwmax 
[1] 0.05 
> sum(res$xbest) ## should be 1 


[1] 1 


The model we have chosen can be solved with a QP solver. We did that on purpose, so that we 
could compare solution qualities. 


[exact-solution] > library("quadprog") 
> covMatrix <- crossprod(fundData) 
> A <- rep(1, Data$na) 
>a<- il 

> B <- rbind(-diag(Data$na) , 
diag (Data$na) ) 

> b <- rbind(array(-DataSwmax, dim = c(DataSna, 1L)), 

array( Data$Swmin, dim = c(Data$na, 1L))) 
> result <- solve.QP(Dmat = covMatrix, 
dvec = rep(0, Data$na), 
Amat = t(rbind(A, B)), 
bvec = rbind(a, b), 
weer = kit) 
> wqp <- result$solution 


What we may do now is compare the objective function values. The realized value of the ob- 
jective function may be hard to interpret, so we scale it: we divide by ns, take the square root, and 
multiply by 100. So we get a weekly return (see page 367 for a description of the data set). 


[P-OF-values] > ¢(100 * sqrt(crossprod(fundData %*% wap) /Data$ns) ) ## QP 


[1] 0.336 


> ¢c(100 * sqrt(crossprod(fundData %*% res$xbest)/DataSns)) ## TA 


[1] 0.336 


The portfolios are extremely similar. To show it, we define a little helper function psim. 
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> psim <- function(x, y) { ## portfolio similarity 


stopifnot (length(x) == length/(y) ) 
same.sign <- sign(x) == sign(y) 
list(same.assets = sum(same.sign), 


weight.overlap = sum(pmin(abs(x[same.sign]), 
abs(y[same.sign]))), 

max.abs.difference = max(abs(x-y)), 

mean.abs.difference = sum(abs(x-y) ) /length(x) ) 


> psim(res$xbest, wap) 


Ssame.assets 
TI 32 


Sweight.overlap 
1] 0.982 


Smax.abs.difference 
1] 0.00757 


Smean.abs.difference 
1] 0.000185 


In the subset-sum problem before, we looked into updating the objective function. We can do 
the same here, because in every iteration, we update the solution and the scenario returns as follows 
(see Chapter 14 for details; the quantity wô is the (sparse) vector of changes): 


n A 


w= ww 
Rw” = R(w° + wôĉ)= Rw? +Rwô 
—S—SJ/ 


known 


The function definitions with updating: 


> OFU <- function(sol, Data) 
crossprod(sol$Rw) 


> neighbourU <- function(sol, Data) { 
wn <- solSw 
toSell <- wn > DataSwmin 
toBuy <- wn < DataSwmax 
i <- DataSresample(which(toSell), size = 1L) 
j <- DataSresample(which(toBuy), size = 1L) 
eps <- runif(1) * DataSeps 
eps <- min(wn[i] - DataSwmin, DataSwmax - wn[j], eps) 
wn[i] <- wn[i] - eps 
wn[j] <= wn[j] + eps 
Rw <- solSRw + Data$R[, c(i,j)] %*% c(-eps,eps) 
list(w = wn, Rw = Rw) 


} 


We run the algorithm. 


> w0 <- runif(Data$na); w0 <- w0/sum(w0) ## a random solution 
> Data$R <- fundData 


[P-psim] 


[P-funs-updating] 
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> sol s= list(w = w0, Rw = DataS$R %*% w0) 
> algo <- list(x0 = sol, 
neighbour = neighbourU, 


Mas = ZOOL, 
iat SS ALO), 
Gc = 0.10, 


printBar = FALSE, 
printDetail = FALSI 
> res <- TAopt(OFU,algo, Data) 


PJ 


One advantage of local-search algorithms is that they work even if R does not have full column 
rank. The weight of asset 200: 


> wqp[200] 


[1] 2.08e-17 


It is not included in the portfolio. If we bind the final column of R to the matrix again, the 
solution should not change. 


> fundData <- cbind(fundData, fundData[, 200L]) 
> dim(fundData) 


[1] 500 201 


> qr (fundData) $rank 


[1] 200 


> qr (cov (fundData) ) Srank 


[1] 200 


> cat(try(result.QP <- solve.QP(Dmat = covMatrix, 
dvec = rep(0, Data$na), 
Amat = t(rbind(A,B)), 
bvec = rbind(a,b), 
meq = 1L))) 


Error in solve.QP(Dmat = covMatrix, dvec = rep(0, Data$na), Amat = t( 
rbind (A, 
matrix D in quadratic function is not positive definite! 


> w0 <- runif(Data$na); w0 <- w0/sum(w0) 
> x0 <- list(w = w0, Rw = fundData %*% w0) 
= alogo <- Inst eO = <0, 

neighbour = neighbourU, 


aS = AOOO 
m = O 
D = SOOO 
ce =O. 20), 


printBar = FALSE, 
printDetail = FALSI 


[za] 
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> res2 <- TAopt(OFU, algo, Data) 


bi] O..3:37 
weights 200 and 201 


> res2Sxbest$w[200:201] 


[1] 0 0 


Being able to handle rank-deficient matrices is not the main advantage: Suppose we wanted a 
different objective function, such as downside deviation: 


Den) 


rj<0 


> OF <- function(w, Data) { ## semi-variance 
Rw <- crossprod(Data$R, w) - DataS$theta 
Rw <- Rw - abs(Rw) 
sum(Rw*Rw) / (4 * DataSns) 
} 


Or Omega: 


> OF <- function(w, Data) { ## Omega 
Rw <- crossprod(Data$R, w) - DataStheta 
-sum(Rw - abs(Rw)) / sum(Rw + abs (Rw) ) 
} 


Simply plug these alternative objective functions into the optimization algorithm. Or suppose you 
want to run the algorithm for asset allocation, and wish to have final weights that are multiples 
of 5%, say. Simple: just set € accordingly in the neighborhood. It is this flexibility that makes 
heuristics such great tools. 
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The aim of portfolio selection is to find combinations of assets such as bonds or stocks that are 
optimal with respect to a given performance measure, which may be based, for instance, on capital 
gains, volatility, or drawdowns. A portfolio selection model, in other words, is a quantitative deci- 
sion rule that tells us how to invest. In this chapter, we will discuss how to implement such models. 
In Chapter 15, we will discuss their testing. 

We start with the workhorse model for portfolio selection: mean-variance optimization 
(Markowitz, 1952). To a considerable extent, this specification is owed to computational restric- 
tions. Already in the 1950s, Markowitz pondered using downside semi-variance as a measure for 
risk, but eventually rejected it mainly because it was much more difficult to compute optimal 
portfolios. However, with heuristics, to which we turn later in this chapter, we can solve portfo- 
lio selection models without restrictions on the functional form of the selection criterion or the 
constraints. 


14.1 The investment problem 


We are endowed with an initial wealth vg, and wish to select a portfolio 
w = [w1 w2 ... Way) 


of the available ną assets. The vector w represents portfolio weights. When we talk about quantities, 
we will call the portfolio u (as in “units”). So, for example, 


u = [20,000 100,000 ...] 


means 20,000 units of the first asset, 100,000 units of the second asset, and so on. A subscript to u 
or w identifies a specific asset (e.g. wj for j =1,..., na). 
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The chosen portfolio is held for a specified period that starts at time O and ends at time T. 
End-of-period wealth is given by 


vr =u' pr = vo w (=) =vww (1 +r). 
0 


elementwise 


The vectors po and pr hold the prices at time 0 and T, respectively. The vector r holds the realized 
returns of the n, assets for the holding period: 


r=[rjr2... Tnx! 


Since these returns are not known at the time the portfolio is constructed, vr will be a random 
variable to us, following an unknown distribution that depends on our choice of w and the data. We 
will usually rescale vy into portfolio returns r? = v7/p — 1. 

Our computational problem will be to choose the w according to a selection criterion ®. This 
criterion may be a function of final wealth vy or the path fvg that wealth takes over time. In the 
latter case, we still have a one-period problem since we do not trade between 0 and T. The essential 
optimization model can be written as 


min ®(w), (14.1) 
Ww 
wi=l1. 


In general, ı is a vector of ones with appropriate length; in Eq. (14.1) it has length ną. Typical 
building blocks for ® may be central moments such as the variance, partial or conditional moments, 
and quantiles. Criteria that depend on the time path of wealth are often based on the drawdown; 
but we may easily include other components such as technical indicators in ®. In principle, ® can 
be any function that can be evaluated for a sample of portfolio returns, and possibly the portfolio’s 
time path (which still is a sample of portfolio returns, just now these returns are sorted in time). 
In fact, the time path of portfolio wealth itself may contain valuable information that is ignored by 
using unsorted differences or returns. 

In this chapter we will discuss computational methods for solving models such as (14.1), and 
also how to add constraints to such models. This means that we essentially skip the phase of setting 
up Model (14.1), i.e. deciding on ®, on the constraints, and on the data that we use. So before we 
discuss methods to solve this model, we would like to stress the actual problem. We are at time 0, 
and we do not know {u,}7 for any portfolio. We need to forecast these values. This is not just a 
nuisance; it is the necessary step to make our models meaningful. If we do not get these forecasts 
right, we should not optimize at all. Portfolio optimization is not a risk-free proposition: failing 
to properly handle the data entails a cost—our portfolio will usually perform worse than simple 
and cheap portfolio rules like equal weights. In the literature, the problem is often framed as an 
estimation problem, but it is only an estimation problem if we assume some underlying process for 
our assets. Such a process may or may not be a good description of reality (or rather a description 
good enough for our purposes). 

This leads to a remark about objective functions: we need to distinguish between what we want, 
and what we should put into our model. What we want is that uy or {vr i have desirable properties, 
such as low volatility. This does not imply, however, that we should necessarily pick a portfolio that 
also has these properties in-sample. It is worth reflecting on this idea. Let us give two examples. 

First, suppose we wish to minimize the variance of portfolio returns. Empirically, variance is 
persistent; if we have two portfolios w™® and w™, and w® is less risky than w™ in one period, then 
chances are that w is also less risky than w® in the following period. In such a case, it makes 
sense to find a portfolio that “looked good” with respect to our objective in the past, and hope it 
continues to do so in the future. But here is a second example. Several authors since De Bondt and 
Thaler (1985) have argued that mean-reversion occurs in stock markets, hence buying past losers 
and selling past winners will lead to higher returns. We could include this in our selection criterion, 
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building a portfolio that is maximally oversold, say, subject to constraints. But it is unlikely that 
we would wish for persistence in our optimization criterion. In other words, our objective function 
would include just the opposite of what we actually want.! 

The only restriction included in Model (14.1) is the budget constraint: all the weights need to 
sum to one. The solution to this model would not be very relevant in practice; we shall need more 
constraints. We may introduce minimum and maximum holding sizes for the assets included in the 
portfolios, and also cardinality constraints that set a minimum and maximum number of assets. We 
can also include other kinds of constraints like exposure limits to specific risk factors, or sector 
constraints. We may want to represent certain quantities as integers or discretize them in other 
ways: maybe because in some markets, quantities are traded in specific lot sizes;? or maybe just 
because a trader simply refuses to buy or sell a nominal amount of 12,132,189.09 of government 
bonds. We can include transaction costs, or turnover constraints (measured with respect to an initial 
or benchmark portfolio). Last but not least, there are many legal constraints on what assets and how 
much of certain assets may be bought and sold. Not all these constraints will be necessary and 
meaningful for given problems; but the important point is that we could include them, and then 
after empirical testing decide whether we need them or not; but we need not exclude them because 
our computational techniques do not allow for certain features. 

In general, then, Model (14.1) cannot be solved with classical methods; hence we will use 
heuristics. But we start with a special case that can be solved. It is the classical case: mean-variance 
optimization. 


14.2 Mean-variance optimization 
14.2.1 The model 


Mean-variance analysis was introduced in Markowitz (1952) and Roy (1952). The idea is to base 
® not on the complete distribution of portfolio returns, but only on its first two moments, return and 
variance. Markowitz’s selection rule states to choose a portfolio w that is mean—variance efficient. 
A corresponding objective function, to be minimized, can be written as 


® = (1—A)Var(r?) —Am’?. (14.2) 


m” and Var(r?) are the forecasts for the portfolio return r? and its variance, respectively. The pa- 
rameter à € [0, 1] is a measure of risk-aversion. With à = 1 we maximize return; with à = 0 we 
minimize variance. This objective function includes only mean and variance, but ignores the over- 
all shape of the return distribution. Furthermore, it only uses final wealth, not the path that wealth 
takes between time 0 and T. The function can be computed from the forecast returns, variances, and 
correlations of the single assets. Recall that r is a vector of n, realized returns; hence, the return r” 
of a portfolio w is 


r=rw. 
For a vector 


£ 
m= |m; m2... m] 


1. Theoretically, the clean approach would be to model expected returns: if high losses in the past are indicative of future 
gains, expected returns should be higher, and the return scenarios that we feed into the optimization should be made to 
reflect that. However, such a modeling has to be much more explicit—instead of saying “asset i will perform well,” we now 
need to say “asset i will have an expected return of x%”—and may thus be much more prone to specification error. And 
there may be operators who wish to avoid such explicit targets: trend following managers for instance will often eschew 
explicit return forecasts. In any case, the methods that we describe in this chapter are flexible enough to handle models for 
both approaches. 

2. Exchanges have often required trades in specified multiples of contracts, so-called round lots. (Fewer contracts are called 
an odd lot.) This practice is getting less common, however. In Europe, for instance, exchanges typically define a round lot 
as | share; in Asia, notably in Japan, exchanges still require trades to be placed in multiples of round lots such as 100 or 
1000. 


358 PART | III Optimization 


of return forecasts, m'w is the portfolio’s return forecast. For the variance we have 
Var(r’) = w' Xow. 


È is our forecast of the variance—covariance matrix of the assets’ returns (see also Section 14.2.5 
below). Returns and variances can be computed and handled quite naturally with MATLAB® and 
R; see the appendix to this chapter. 


14.2.2 Solving the model 


Model (14.2)—as long as we have only linear constraints—can be solved by quadratic program- 
ming (QP). See Gill et al. (1986) for a description of the technique; we shall only discuss its 
application. A QP solver computes solutions to the following model (details depend on the partic- 
ular implementation): 


min —c'w + zw Ow (14.3) 
subject to the constraints 
Aw=a, 
Bw >b 


Thus, we minimize a quadratic function subject to linear equality and inequality constraints. For 
our model, the matrix Q is symmetric and of size na x na; the vector c is of length n,. The 1/2 in 
front of w’ Qw is for convenience only: the derivative of w’ Qw with respect to w is (Q + Q’)w or, 
since Q is symmetric, 2Qw (Petersen et al., 2008, p. 11). Hence, the 1/2 cancels out. A is a matrix 
with as many rows as equality constraints and n,a columns. B is a matrix with as many rows as 
inequality constraints and ną columns. a and b are vectors with lengths equal to the numbers of 
rows of A and B. 

QP is an iterative method. It starts with an initial guess supplied by the user and then suc- 
cessively changes this portfolio until some convergence criteria are met. In a given iteration, the 
program evaluates a new portfolio by computing its mean and variance. The variance—covariance 
matrix X (size na X na) and the return forecast vector m (length na) are fixed; hence, computing 
time will generally increase with the number of assets. 

It is the inequality constraints in Model (14.3) that require a QP solver.? A quadratic model 
with linear equality constraints can, at least in principle, be solved by computing its first derivatives 
(which are linear equations), setting those equations to zero and solving them. 


14.2.3 Examples of mean-variance models 


In this section, we will discuss several classical portfolio selection models that can be formulated 
as Model (14.3). 


Minimum-variance portfolios 


We start with the simplest and—when constrained—practically most useful case of the mean- 
variance models: minimizing the variance of a portfolio. For now, we include only one restriction, 
the budget constraint. Formally, our model will be 


min w’ Dw (14.4) 
w 
subject to 
wi=l. 


3. More accurately, we need a solver that can handle a quadratic objective function under equality and inequality constraints. 
Heuristics, for instance, can solve such models as well. 
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In the formulation of Model (14.3), we have 


Q=22, c =[0,0,0,...], 
— 
nA 
A=[1,1,1,...], a=1 


We could also set Q = È for if w minimizes w’ Zw, then it also minimizes positive constant x 
w' dw. 

An example in R; we use the function solve. QP from the quadprog package (Turlach and 
Weingessel, 2007). We will use random data in this example and other examples later on; so let us 
first put this computation into a function random_returns. 


> random_returns <- function(na, ns, sd, mean = 0, rho = 0) { 
## sd = vol of returns 
## mean = means of returns 
## ==> both may be scalars or vectors of length na 


ans <- rnorm(ns*na) 
dim(ans) <- c(na, ns) 


alae (tele) Js O) 4 
€ <- array (rho, 
diag(C) <- 1 
ans <- t(chol(C)) %*% ans 


Glin = else, ae) 


ans <- ans*sd 
ans <- ans + mean 
t (ans) 


> ## minimum-variance portfolio with budget constraint 
> library ("quadprog") 


> ## create random return scenarios 
> R <- random_returns(na = 10, ns = 60, sd = 0.015) 


> ## minimize variance 
= 0 <== 2 * cov(R) 
> A <- rbind(rep(1,10) ) 
>a<- 1 
> result <- solve.QP(Dmat = Q, 
dvec rep(0, 10), 
Amat ie (EN) 
bvec Gly, 
meq 1) 


> ## check budget constraint and solution 


> w <- result$solution 
> sum(w) 


[1] 1 


> all.equal(as.numeric(var(R 


## budget constraint: 


should be 1 


result$value) 


[random-returns ] 


[minimum-var] 


360 PART | III Optimization 


[1] TRUE 


A side note: we have created the variance—covariance matrix by first simulating return scenarios 
and then computing the matrix from those returns, which is much easier than coming up with a 
random variance—covariance matrix from scratch; see also Section 14.2.5 below. 

Typically we include minimum and maximum holding sizes in the model, i.e. 


min 


wj 


<wj< we for all j. 


The vectors w™" and w™* hold lower and upper position bounds. Every asset can have specific 
limits. Let 7„, be the identity matrix of size na x na, then the constraints can be written as 


—1 —w}] 


B= = Ej and b= "A 


nA 


An Example: with 50 assets, the matrix B can be constructed with the following commands. 
In MATLAB: In R: 
[-eye(50); eye(50)] rbind(-diag(50), diag(50)) 

To following MATLAB code computes the long-only minimum-variance portfolio. 


Listing 14.1: C-PortfolioOptimization/M/./Ch13/minVar.m 


% minVar.m -- version 2010-12-12 


% generate artificial returns data 


ns = 60; % number of scenarios 
na = 10; % number of assets 

R = 0.005 + randn(ns, na) * 0.015; 
Q = 2 + cov(R); 

% set up 


c = zeros(1,na); 
A = ones(1,na) 
B 


zasl; 
= -eye(na); b = 


zeros (na, 1); 


% solution 

w = quadprog(Q,c,B,b,A,a); 
% check constraints 
sum (w) 

19| all (w>=0) 


Ree ee ee eee 
CIANDNMFSWNFTUANANIADUNFWN HE 


The lower limit w™" does not have to be zero. We could, for example, have assets selected with 
weights between 2% and 5%. But this method does not allow us to specify constraints like “either 
zero weight (not in the portfolio), or a weight between 2% and 5%.” Models with such restrictions 
can be solved with heuristics, and we discuss them later. 
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Mean-variance efficient portfolios 


Suppose we wish to find a portfolio that has minimal variance for a given required return ra. 


need to solve the following model: 


(14.5) 


subject to 


With a vector m of return forecasts, we get the following formulation: 


Q=22, c=[0,0,0,...]’, 
A=[1,1,..], asd: 
m' ra 
B= —Ing ‘ b= —wmx 
Ina wmin 


If you prefer a more visual representation of the constraints: 


w1 
w2 


— 

vu 
II 

— 


| 
a 
< 


Wna 


l 
IA 


IA 


min 


Note that the only difference to the minimum-variance case is the second line: the constraint on 


required return. 


The following R code provides an example. The computation is also implemented in the func- 


tion mvPort folio in the NMOF package. 
> library ("quadprog") 


> ## create random returns 
> R <- random_returns(na = 20, ns = 
sol = 0.005, 


60, 
mean = 0.0025) 


> na <- ncol (R) 


[mean-variance] 
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> m <- colMeans(R) ## create return forecasts 
> rd <- mean(m) ## =... and required return 


> wmax <- 0.1 ## maximum holding size 
> wmin <- 0.0 ## minimum holding size 


## set up matrices 
Q <- 2 x cov(R) 
A <- t(rep(1, na)) 
a <- 1 
B <- rbind(t(m), 

-diag(na), 

diag (na) ) 
> b <- rbind(rd, array(-wmax, dim = c(na, 1)) 

array( wmin, dim 

> result <- solve.QP(Dmat = Q, 
dvec = rep(0, na), 
Amae = maona An By 
lowe = aelosiacl(el, 19) , 
meq = 1) 


Ve VV VSL 


1 
a 
3 
w 
= 


> w <- result$solution 


> sum(w) ## check budget constraint 
[1] 1 
> c(w $*3 m >= ## check return constraint 


rd - sqrt(.Machine$double.eps) ) 


[1] TRUE 
> summary (w) ## check holding size constraint 
Min. 1st Qu. Median Mean 3rd Qu. Max. 


0.0111 0.0320 0.0475 0.0500 0.0649 0.0970 
We crosscheck with mvPortfolio 


[mvPortfolio] > all.equal (w, 
VArorieitoulie (im, Cron (is), mila recura = acl, 
Wid = Ol, wes = (0), 11)))) 


[1] TRUE 


The tangency portfolio 


The tangency portfolio is the portfolio that maximizes the ratio of excess return to portfolio volatil- 
ity (Tobin, 1958). It is obtained from the model 


max ————— (14.6) 


subject to 


wi=1 (14.7) 


Portfolio optimization Chapter | 14 363 


and possibly other constraints. rf is the risk-free rate, assumed to be a constant. The optimization 
makes sense only if there exists at least one portfolio for which m'w is larger than rf. Otherwise 
excess return is negative and no one would want to hold risky assets. 

In the following R code, we provide two further ways to compute the tangency portfolio: the 
regression formulation of Britten-Jones (1999) and the computation via the first-order conditions. 


> library ("quadprog") 


> ## create random returns 
> na <- 20 

> ns <- 60 

= 


R <- random_returns(na = na, ns = ns, 
sol = O05, mesm = 0.005) 


> m <- colMeans(R) ## means 

Sime <= @O@Ou:L ## viskfree rate (about 2.5% pa) 
> m.ex <- m - rf ## excess means 

> ## set up matrices 

> Q <- cov(R) ##COVariance matrix 

> B <- t(m.ex) 

= io = l 

> result <- solve.QP(Dmat = Q, 


dvec = rep(0, na), 


Amat 1 (ED) 5 
bvec 18), 
meq L) 


> ## rescale variables to obtain weights 
> w <- result$solution/sum(result$solution) 


> ## compute Sharpe ratio 
> SR <- c(t(w) 3*% m.ex / sqrt(t(w) 5*3 O S*3 w)) 
> sum(w) ## check budget constraint 


= E ## check return constraint 


[1] TRUE 

> ## test 1: regression approach from Britten-Jones (1999) 
> R2 <- R - rf 

> ones <- array(1, dim = c(ns, 1)) 

> solR <- Im(ones ~ -1 + R2) 

> w2 <- coef(solR) 

> w2 <- w2/sum(w2) 

> ## ... w2 should be the same as w 

> all.equal(as.numeric(w), as.numeric(w2) ) 


[1] TRUI 


GI 


> ## test 2: no inequality constraints >> solve FOC 
> w3 <- solve(Q, m.ex) 

> w3 <- w3/sum(w3) 

> # ... w3 should be the same as w2 and w 


[tangency] 
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> all.equal(as.numeric(w2), as.numeric(w3) ) 


[1] TRUE 


A tracking portfolio 


We may also want to minimize a function of relative returns. Suppose we know a benchmark 
portfolio w'™ and wish to minimize the variance of the difference between the benchmark returns 
and the returns of our portfolio. The following formulation is possible: 


min Var(r? — r'w™) = Var(r?) + Var(r'w™) — 2Cov(r?, rw). 
w 


The variance of the benchmark portfolio does not depend on the portfolio weights w, so we drop it 
from the expression and rearrange: 


— Cov(r'w™, r'w) + sw Var(r) w. 
— poet 
z 


bm 


For a scenario set of asset returns R (size ns x na), the benchmark returns are Rw™. Hence: 


—Cov(Rw™, R) w + 4w! Var(R) w. 
— mm — 


e Q 


The c term is a vector of the covariances of the benchmark portfolio with the single assets; Var(R) 
is the variance—covariance matrix of the columns of R. 


Computing the whole frontier 


As a final example, we will trace out the complete mean-variance efficient frontier. There are three 
ways to do so. 


1. Minimize variance while varying desired return ry; 

2. maximize return while varying the maximally allowed variance (note that this problem cannot 
be solved with QP because we have a linear objective function, but a quadratic constraint); or 

3. directly work with the objective function (14.2). This third representation has two advantages. 
First, we do not need to fix ra beforehand—if we had to, we would need to compute the return 
of the minimum-variance portfolio first, and set rg greater than this return. Second, we need not 
care what the highest possible r4 is—with constraints, the maximum achievable return may not 
be immediately obvious. 


We recall our mean-variance problem as 
®=(1—A)w/Sw—Aw'm. 


We now vary à between 0 and 1. Note that we used the expressions for the portfolio return and 
variance that explicitly rely on the portfolio weights. We set 


Q=201-A)z, c=ìm, 

A=[1,1,1,...], a=1, 

gel. al 
Ina w 


The following R and MATLAB programs illustrate such a computation. For R, there is a function 
mvFrontier in the NMOF package. 


WP E A a A A MENONA NT N 


E SN 


WLV NM M aN NEMEN 


ABR WN 
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library ("quadprog") 
library ("NMOF" ) 


R > Eneee ILS IO), iL sS@) 


na <- dim(R) [2L] # number of assets 
m <- colMeans (R) 
Sigma <- cov(R) 

wmax <- 1.0 # maximum holding size 
wmin <- 0.0 # minimum holding size 


## compute frontier 
nFP <- 100 # number of frontier points 
lambdaSeq <- seq(0.001, 0.999, length = nFP) 
A = caen dL, ‘lim = El, ae 
B <- rbind(-diag(na), diag(na) ) 
a <- 1 
b <- rbind(array(-wmax, dim = c(na,1)), 
array( wmin, dim = c(na,1))) 


## matrix for effcient portfolios 
pMat <- array(NA, dim = c(na, nFP) ) 
rownames(pMat) <- paste("asset", 1:na) 
for(lambda in lambdaSeq) { 
result <- solve.QP(Dmat = 2*(1 - lambda) *Sigma, 
dvec = lambdaxm, 
Amat = t(rbind(A, B)) 
INVES: = cielomiarGli(e,, 19), 
meq = 1) 
pMat[, which(lambda==lambdaSeq)] <- resultS$Ssolution 


## plot results, for included assets only 
## (plot is not shown in the text) 
incl <- apply(pMat, 1, function(x) any(x > 1le-4)) 
do.call(par, par.portfolio) 
ocetno = C5; O25, OV) 
oe (iene = CA AGN) 
cols <- grey.colors (nrow(pMat) ) 
bp <- barplot(100*pMat, legend.text = FALSE, space = 0, 
ylab = "Weight in %", 
zdala = OT, 
Gol = Coils) 
pan pds — RUB) 
legend ("topright", 
legend = rownames(pMat) [incl], 
col = Coils [aiavetk | , 
lty =: ily, 
Wwe = 5), 
inset = ¢c(-0.2, 0)) 
mtext ("Increasing Risk", 1) 


Listing 14.2: C-PortfolioOptimization/M//Ch13/meanVar.m 


365 


% meanVar.m -- version 2011-05-16 

%% compute a mean--variance efficient portfolio (long-only) 
% generate artificial returns data 

ns = 60; % number of scenarios 

na = 10; % number of assets 


[frontier] 
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6| R = 0.005 + randn(ns, na) » 0.015; 

7|Q = 2 x cov(R); 

8|m = mean (R); 

9| rd = 0.0055; % required return 

10 

ll|c = zeros(1,na); 

12;}A = ones(1,na); % equality constraints 
13)/a = 1; 

14| B [-m; -eye(na)]; % inequality constraints 
15}b = [-rd zeros(1,na)]’; 

16 

17| w = quadprog(Q,c,B,b,A,a); 

18 


19|% check constraints 
20| sum(w), all(w>=0), m * w 


22| %% compute and plot a whole frontier (long-only) 
23| npoints = 100; 

24| lambda = sqrt(1-linspace(0.99,0.05,npoints) .*2); 
25)B = -eye(na); b = zeros(1,na)‘; 


26| for i = 1:npoints 

27 Q = 2*lambda(i) * cov(R); 

28 c = -(1-lambda(i)) * m; 

29 w = quadprog(Q,c,B,b,A,a); 

30 plot (sqrt (w’*cov(R)*w),m*w,’r.’), hold on 
31) end 


32| xlabel (‘Volatility’) 
33| ylabel (' Expected portfolio return’ ) 


14.2.4 True, estimated, and realized frontiers 


Efficient frontiers should not be used as a practical quantitative model. The trouble does not come 
from the model, but from the data. We do not know the “true” expected returns, variances, and 
correlations. Not knowing a quantity is not necessarily a problem in itself; after all, statistics is all 
about inferring unknown quantities from sample data. But here the statistics enter as an ingredient 
into our decision what portfolio to hold. So, the relevant question is not “How large are our forecast 
errors?” but rather “Given a typical forecast error, what are the costs of such an error?” It turns out 
that the costs are high, and the chosen portfolios often perform miserably. For empirical evidence, 
see for instance the early results of Cohen and Pogue (1967); Frankfurter et al. (1971); and, in 
particular, Jobson and Korkie (1980); Jorion (1985, 1986); Best and Grauer (1991); Chopra et 
al. (1993); Board and Sutcliffe (1994); DeMiguel et al. (2009). Brandt (2009) gives a very good 
overview. Thus, the idea of an efficient frontier is an insightful “instrument of thought,” but not a 
quantitative decision tool. 

Unfortunately, the academic literature has mostly treated these data troubles as an “estimation” 
problem. The solution would then be to devise better estimators to infer the true parameter values 
from sample data. But that view obscures the issue—it is practically much more useful to consider 
portfolio optimization as a forecasting problem. When we estimate a quantity and then use it as an 
input to the optimization, we make an (often implicit) assumption about a model for the quantity 
of interest, or about properties of the model. For example, when we estimate expected returns by 
historical mean returns, we assume that expected returns are constant. But there is no reason to 
assume that a quantity like the expected return is constant over time or slow-moving. If we treat the 
issue really as an estimation problem, we should first make explicit our choice of the model. Very 
likely then, the problem will be model selection, not estimation. 

For the sake of the argument, suppose all the quantities of interest are constant. Even then it 
would be extremely difficult to estimate them, even from a reasonably-sized sample. These diffi- 
culties are well studied in the academic literature. Early papers go back to the 1960s and 1970s; 
a main contribution came from Jobson and Korkie (1980). To demonstrate the authors’ basic idea, 
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FIGURE 14.1 True frontier and estimated and realized frontiers (no short sales, na is 25, ng is 100). 


(i) we fix true expected returns m, and a true variance—covariance matrix ®©., (ii) we simulate 
data with these parameters, and (iii) we estimate parameters m and & from the simulated data, 
and compute optimal portfolio weights w corresponding to the estimates. Since we know the true 
parameters, we are able to compute the truly optimal portfolio wą, and we can judge how close w 
and w, are. We compute three different frontiers (Broadie, 1993, and Windcliff and Boyle, 2004): 


1. The true frontier (1 — A)w, Ey Ws — ÀM! Wx, 
2. the estimated (“hoped-for”) frontier (1 — A)w’Xw — Am'w, and finally 
3. the realized frontier (1 — A)w’X,.w — Aw’m, . 


We use the data set fundData from the NMOF package. This data set consists of 500 scenarios 
of weekly returns of 200 mutual funds. These returns are not time series but were created by a 
resampling procedure described in Gilli and Schumann (201 1c), and Schumann (2010, Chapter 4). 
The basis for the scenarios were 200 price series of mutual funds that are registered in Germany. 
All series ran from January 2007 to December 2009; all funds were denominated in euro. The 
funds were randomly chosen from a larger database with the restriction that volatility was not 
degenerate (it had to be greater than 1% for weekly returns). A fund’s return was modeled by a 
linear model with just two factors, the German DAX 30 and the REXP index. For each fund, we 
estimated the factor equation and so obtained coefficients and a vector of residuals. The scenarios 
were then built by bootstrapping factor returns and residuals. The details of the data set do not 
matter here; it only serves to test the algorithms (as a matter of fact, we could have set up any 
arbitrary vector mą and matrix X). What matters is that, by construction, the returns in a given 
column of fundData are i.i.d. 

We select the first 25 columns (assets), and 100 rows (scenarios) and compute the sample means 
and variance—covariance matrix of this subset as the true parameters. Next we create Gaussian 
random samples with these moments, and compute optimal portfolios for these samples. Finally, 
we evaluate the portfolios with respect to the true parameters. For every random draw, we obtain 
one estimated frontier and one realized frontier; the true frontier does not change. Fig. 14.1 shows 
typical results. 

Michaud (1989) calls this effect “error maximization.” The estimated frontiers are generally 
above the true frontier; they promise more return for a given level of risk. However we do not 
get the estimated frontiers, but the realized ones. A realized frontier cannot be better than the true 
frontier.* There are several lessons to be taken from this picture: first, a portfolio that appears effi- 
cient in-sample is often inefficient out-of-sample; second, we should not expect efficient frontiers 
to slope upwards out-of-sample. Also, even if we are relatively efficient out-of-sample, it is virtu- 
ally impossible to achieve a desired mean-variance trade-off (see also Broadie, 1993). Importantly, 


4. This follows from the definition of the realized frontier. We can always be lucky in a single draw. 
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this picture is still far too optimistic: we have assumed a fixed data-generating process for our data, 
but this is not the case in reality. 

Let us make a few qualifications to these results. First, empirically, an exploitable trade-off 
likely exists across asset classes. For instance high-quality government bonds should over time 
give lower returns with lower risk. (As indeed they have over the last century; see Dimson et al., 
2002. Whether we need a very precise tool to handle this trade-off is another question.) Second, 
there is some empirical evidence for an intertemporal positive relationship between risk and return 
within single asset classes (though only over longer horizons); see, for instance, Harrison and Zhang 
(1999); and Bali et al. (2009). We only look at one-period optimization models, so we do not try to 
incorporate such a relationship. 

Note that our point here is not against portfolio optimization but we caution against efficient 
frontiers. We do not argue that there is no trade-off between risk and reward, but we are skeptical 
about the idea that we can precisely control reward and risk. Empirically, there is substantial evi- 
dence that we can control the risk of a portfolio (e.g., Chan et al., 1999, among many other studies); 
for returns, there is no such evidence. 

Altogether, portfolio optimization is a valuable tool. But we do advocate not to simply translate 
the textbook description of portfolio selection into an empirical model. In particular, be wary of 
efficient frontiers. In fact, the generic portfolio selection model (14.1) did not imply we had to 
compute efficient frontiers. 


14.2.5 Repairing matrices 


Correlation and variance—covariance matrices are building blocks of many financial models, not 
just in portfolio optimization. In this section we discuss cases in which the variance—covariance 
matrix is not full-rank or is indefinite. 


A helpful identity 


We have a matrix R of size ns x ną (a sample of ns observations of a random vector r of length 
na). Define the vector m as the vector of column means of R, that is, 


1 / 
m= —R'L, 
Ns 
then the identity” 
1 / / 
—R R=Cov(R) + mm (14.8) 


Ns 


holds. The operator Cov maps the columns of the matrix R into a variance—covariance matrix (we 
use the maximum likelihood estimators, so we divide by ns).° So if the mean vector is zero, the 
cross product of R equals the variance—covariance matrix up to a scalar. This identity shows how 


5. This is the matrix equivalent of E(X?) = Var(X) + (E(X))2. 

6. The sample variance is usually estimated by dividing by ns — 1 because this gives an unbiased estimator for the 
variance—and we are sure not to try to compute the variance if we only have a single observation. More seriously, di- 
viding by ns is just as fine for several reasons: 


1. It does not matter practically. For small ng, the obtained values will differ numerically, but there is no telling which 
value is more appropriate for the application at hand. For larger ng, the values will be similar. 

2. Unbiasedness is certainly a desirable property for an estimator, yet it also matters how large the bias is, and how large 
the variance of an estimator is. These two properties are combined in the mean squared error of an estimator, which 
is the sum of variance and bias squared. It turns out that dividing by ns instead of ng — 1 leads to an estimator with a 
lower mean squared error. Thus, dividing by ng results in an estimator that is actually more accurate; see Greene (2008, 
Appendix C). 

3. If we take the degrees-of-freedom issue really seriously, we need to consider the specific application. When we compute 
a variance—covariance matrix for a portfolio optimization problem, we actually require (analytically) the inverse of this 
matrix. Having an unbiased estimator for the variance matrix will not assure that the inverse is also unbiased; in fact, it 
will not be. The correct degrees of freedom are then ns — na — 2. See Brandt (2009). 
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R and Cov(R) are related: essentially, the variance—covariance matrix is the cross product of R. 
Accordingly, whenever R does not have full column rank, Cov(R) cannot be full-rank. 
The following script tests Eq. (14.8) in R. 


= is <= 00 

> na <- 10 

> R <- array(rnorm(ns * na), dim = c(ns, ns) ) 

> R1 <- crossprod(R)/ns 

> R2 <- ( (ns-1)/ns )*cov(R) + outer(colMeans(R), colMeans(R) ) 
> identical (R1, R2) 

[1] FALSE 


> all.equal(R1, R2) 


GI 


[1] TRU] 
Numerically, R1 and R2 will not be exactly equal, hence identical will give a FALSE. 


Indefinite matrices 


We have ns return observations of n, assets, collected in a matrix R (each series is one column). 
An estimator for the variance—covariance matrix of these returns can analytically be written as 


1 / 1 A 1 $ 
È= — R |I- —u | R= —RMR. (14.9) 
ns ns ns 
M 


The matrix M transforms each column of R into deviations from the respective mean. M is idem- 
potent; hence, R’M’MR = R'MR. M has rank ns — 1; thus, if R has full column rank, we need at 
least na + 1 observations to obtain a full-rank variance—covariance matrix. 

The variance—covariance matrix can be rewritten as follows: 


o 1 pia -Pina | For 
02 p2,1 1... Pma o2 
y= 
Ona Pnx,l Pny,2 +> 1 Ona 
—_—_—_—_— 
D C D 


where C is the correlation matrix (the elements ;,; are the linear correlations), D is a matrix with 
the assets’ standard deviations (0;) as its diagonal elements, and zeros elsewhere. So we need a 
correlation matrix and the volatilities of the assets. 


Example 14.1 


Fast computation with diagonal matrices. 

The product DCD can be computed faster by computing diag(D) diag(D)’ and then multiplying this 
rank-one matrix element-wise with the dense matrix C. (Here, diag is an operator that creates a column 
vector out of a matrix’s main diagonal.) This computation requires 3n operations, while the operation 
count for the full matrix product is 4n3. Of course, for very small matrices, the speedup may be incon- 
sequential. Note also that if C has a particular structure (e.g., is symmetric as in the case of a correlation 
matrix), even more efficient algorithms may be applied. Example code follows. In R: 


[identity] 
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N <- 1000 ## set size of matrix 


## create matrices 
D <- diag(runif (N) ) 
@ <= emren (iti (iy 2 IN), cain = GON; iN) )) 


## compute product / compare time 

library ("rbenchmark" ) 

benchmark (Z1 <- D $*% C %*% 
Z2 <- outer (diag (D 
Z3 <- diag(D) %*% 
order = "relative" 


D, 
Jo Chie) ) s C; 
TCR) = C, 
WE tga] 


test replications 


Z3 <- diag(D) %*% t(diag(D)) * C 100 
Z2 <- outer(diag(D), diag(D)) * C 100 
Z1 <- D $*% C %*% D 100 


elapsed relative 


0.448 1.00 
0.567 1.27 
34199 7.14 


## check difference between matrices 


mers lalos (Al = AA #2 oaa (he VSS al Eere (AL; AA) 


[1] 1.11e-16 


> 


merse (VAIL = AD Be 2 5 ie VSS gull ll crepe (ail, A3) 


[iy 4. tte=16 


WA AWW AY NY 


BPRDNY Ww UO 


[1] TRUI 


## ... Or with the Matrix package 

library ("Matrix") 

D2 <- Diagonal(x = diag(D) ) 

benchmark(Z1 <- D %*% C %*% D, 
Z2 <- outer(diag(D), diag(D)) * C, 
wey = Glare (ip) seks eCa 3 C, 
Z4 <- D2 S*% C %*% D2, 
Z5 <- outer(diag(D2), diag(D2)) * C, 
Orcan = Vigeuleveanyz)) i (il, 3), 4) I 


test elapsed relative 


Z5 <- outer(diag(D2), diag(D2)) * C 0.311 
Z3 <- diag(D) %*% t(diag(D)) * C 0.466 

Z2 <- outer(diag(D), diag(D)) * C 0.540 
Z1 <- D %*% C %*% D 3.909 

Z4 <- D2 %*% C %x% D2 4.599 


## check difference between matrices 
all.equal(as.numeric(Z1), as.numeric(Z4) ) 


Gl 


.00 
0 
74 
sod 
79 
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> all.equal(as.numeric(Z1), as.numeric(Z5) ) 


en! 


[1] TRU] 
In MATLAB: 
Listing 14.3: C-PortfolioOptimization/M/./Ch13/diagonalmult.m 


diagonalmult.m -- version 2010-11-27 


oe 


% set size of matrix 
= 500; 


Z 


% create matrices 
D = diag(rand(N, 1)); 
M = rand(N, N); 


compute product / compare time 
tid, Z= Dæ M+. Dy toc 
tic, Z2 = diag(D) * diag(D)’ .* M; toc 


% check difference between matrices 
max (max (abs (Z1 - Z2))) 


a atl a ae oe 
ABWNKFK TUAADUNHSWN KE 
oe 


The diagonal matrix D is of full rank unless at least one standard deviation is zero (conditional 
on numerical precision). But this would imply that we have a degenerate asset, with constant re- 
turn; the correlation between a constant and a random variable is not defined. In any case, D will at 
least be positive-semidefinite, since the standard deviations cannot be negative. So if the variance— 
covariance matrix is not positive-semidefinite, neither is the correlation matrix C. This suggests 
that it is enough to investigate the correlation matrix. 

There are cases in which the correlation matrix is indefinite. The most obvious case occurs if we 
arbitrarily change entries of the matrix (e.g., for stress tests); it can also happen if the correlations 
are computed pairwise and some series have missing values. Importantly, the variance—covariance 
matrix will never be indefinite if we compute it from complete time series, that is, from the product 
R’R. What would be the problem if the matrix were indefinite? A semidefinite matrix © ensures 
that w’Xw > 0 for all w. This implies that the variance cannot become negative (which makes 
sense). But we do not have such a guarantee anymore if the matrix is indefinite. 

An example: suppose we are to construct a portfolio of three assets, each asset having a volatility 
of 0.3 (i.e. 30%). We also have the correlation matrix 


1.0 0.9 0.9 
C= 0.9 1.0 0.2 , 
0.9 0.2 1.0 


which may look innocuous but it is not. Consider the following strategy: sell short 100% of the first 
asset, buy 100% of the second and the third asset, that is, select a portfolio w = [—1, 1, 1]. The 
portfolio’s variance is —0.018, the standard deviation is 0.131, where i = J=1. 

A simple repair mechanism is based on the spectral decomposition of the correlation matrix. If 
the correlation matrix is indefinite, then at least one eigenvalue is negative. The eigenvalues of the 
matrix C can be computed with eigen (use eig (C) in MATLAB): 


> € <= merile a 0.9, OS, 
Os, 4b 4 Wey 
O-9, O.2, i yp, scueow = 3, lonvacowy = ARU) 


> eigen(C)S$values 


[eigen] 


[repair-matrix] 
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[1] 2.377 0.800 -0.177 


So C is indefinite. To fix it, we replace all negative eigenvalues with zero. Since the resulting matrix 
will not have a main diagonal of ones anymore (which is required for a correlation matrix), we need 
to rescale the matrix. Algorithm 52 summarizes the procedure. For more details, see Rebonato and 
Jackel (1999). 


Algorithm 52 Repair mechanism for correlation matrix. 


1: compute eigenvalue decomposition C = V A V’ 
2: set Ac = max(A, 0) 

3: compute Ce = V Ac V’ 

4: rescale Ce to obtain unit main diagonal 


The matrix V in Algorithm 52 stores the eigenvectors of C as its columns, and A is a diagonal 
matrix of the eigenvalues. The scaling matrix S is the main diagonal of VA.V’. The rescaled 
correlation matrix C is given by /SVAV’/S. The expression /S means to take the square root 
of the elements of S (which are nonnegative by construction). 

In MATLAB: 


Listing 14.4: C-PortfolioOptimization/M/./Ch13/repair.m 


% repair.m -- version 2010-12-10 
% --compute eigenvectors/-values 
[V, D] = eig(C); 


% --replace negative eigenvalues by zero 
= max(D, 0); 


=] 


% --reconstruct correlation matrix 
C = V *x D> V’'; 


% --rescale correlation matrix 


ee eee 
BWNrF THUAN ADAUN EWN KE 
Q 


S = 1 ./ sqrt(diag(CC)); 
SS =S * S’; 
C = CC .* SS; 


For R, the computation is contained in the function repairMatrix in the NMOF package. 


> repairMatrix 


function(C, eps = 0) { 
# compute eigenvectors/-values 


= 


E <- eigen(C, symmetric = TRUE) 
V <- ESvectors 
D <- ESvalues 


# replace *negativex eigenvalues by eps 
D <- pmax(D, eps) 


# reconstruct correlation matrix 
BB <- V %*% diag(D) %*% t(V) 


# rescale correlation matrix 
T <- 1/sqrt(diag(BB) ) 

TT <- outer (T,T) 

BB x TT 


} 


<environment: namespace :NMOF> 
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Let us use the function to “repair” C. 


SC 

[,1] [,2] [,3] 
E1] 1.0 0.9 0.9 
[25] 0.9 1.0 O22 
[3,] 0.9 O22 2.0 
> repairMatrix(C) 

Let] [,2] Leo 
[1,] 2.000 0.785 0.785 
[2,] @.785 1.000 0.23571 
[3,] 0.785 0,237 21.000 


Semidefinite matrices 


So far we have discussed indefinite matrices, so we have negative and positive eigenvalues. The 
same procedure is sometimes suggested to transform positive-semidefinite matrices (i.e., matrices 
with some zero eigenvalues) into positive-definite matrices (i.e., matrices with only positive eigen- 
values). That is, we can turn a rank-deficient matrix into a full-rank matrix by changing step 2 in 
Algorithm 52 into 


set Ac = max(A, £) 


with £ set to a very small number. (You may have noticed that the function repairMatrix 
has an argument eps, which allows to specify such a number.) This trick may allow us to run 
certain numerical procedures. For example, the algorithm called by solve. QP in the quadprog 
package requires a positive-definite matrix Q, which implies full rank. But: we only exchange one 
“bad” matrix for another one. In other words, we may be able to work with a matrix numerically, but 
this does not mean that the results need be meaningful empirically. A simple example can illustrate 
this. Assume we have a linear regression model with just two regressors, zı and z2. Suppose that 
z2 is actually 1.5z,. Now obviously, we cannot estimate such a regression because we have perfect 
collinearity. But would we regard this problem as solved if we replaced z2 with z3 = z2 + € ? 

If a matrix R does not have full column rank, then we should call this a numerical property of 
R, but not a numerical problem in the sense of “if we could solve the numerical issues we would 
be fine.” Rank-deficiency is an empirical problem. As an example, we create a small data set R. 
Suppose these were daily-returns data of 10 equities. 


> na <- 10 ## number of assets 
> nobs <- 10 ## number of observations 


Vv 


R <- array(rnorm(nobs * na, sd = Gaume—e (mobs, 


qr (cov (R) ) $rank 


0.01), na) ) 


Vv 


[1] 9 


The rank of the covariance matrix is only 9; we would need na+1 observations to get full rank. We 
can still compute the standard deviation of a portfolio: 


> ew <- rep(l1/na, na) 
> c(sqrt (ew %*%3 cov(R) 


## equal-weight portfolio 
ew) ) 


Q Q 
SxS 


[1] 0.00449 


[try-repair] 


[rank] 


[zero-vol ] 
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But with a non-full rank matrix and no constraints, it is guaranteed that you have portfolios like this 
one: 


> zerovol <- svd(cov(R))$v[, 10] 
> c(sqrt(abs(zerovol %*% cov(R) %*% zerovol) )) 


[1] 6.14e-11 


That is, a portfolio without volatility. 

Whether that matters depends on your application. With constraints, perhaps not. Fortunately, 
often it is not a “real” problem. Three examples: 

(i) Suppose the correlation matrix of R is a matrix of ones, so each pairwise correlation is 
unity. Clearly, this matrix has rank one. Is that a problem? Yes and no. Our chosen measure of 
dependence is correlation, and this correlation matrix tells us that no long-only combination of 
assets will yield any diversification benefit. So, financially, we do not get diversification. But is 
it a problem numerically? No, we just pick a portfolio that satisfies our needs best; for example, 
we pick the asset with the highest Sharpe ratio. (Needless to say, this is a contrived example. In 
practice, diversification should not be driven by price data alone, as risks may not be reflected in 
price movements. Just think of fraud, for instance, or defaults in general.) 

(ii) Suppose R does not have full column rank. Then there are linear combinations of the 
columns (i.e., portfolios) that behave exactly like other linear combinations (i.e., other portfolios). 
Potentially, then, our chosen solution is not unique. Such linear combinations are not always feasi- 
ble (e.g., they may require short sales), so once we include constraints, our solution may become 
unique. But suppose that even with restrictions we still do not get a unique solution. Then, ac- 
cording to our model, we cannot differentiate between different solutions. Why should that be a 
numerical problem? We could either pick a portfolio randomly, or (more likely) think about how 
we should extend our model. 

It is also possible that R has full column rank, but R’R does not, as a result of finite-precision 
arithmetic. This is only very rarely a concern for finance problems when we work in double preci- 
sion, and all the points just made still apply. 

(iii) If we have fewer observations than assets, a correlation matrix will not have full rank. But 
again, this is an empirical problem; the matrix will still be nonnegative-definite. Not all implemen- 
tations will accept such a matrix (a heuristic method could always be used; see the next section). 
Yet the main problem is that we try to estimate many quantities from too few data points. 


In any case, beside the numerical trick of repairing a matrix described above, there are other 
techniques available that help to overcome semidefiniteness, and which may even help on the em- 
pirical side: factor models and shrinkage. In a linear factor model, we represent the returns of asset i 
by an equation like the following: 


rO =B 4 fO tO, (14.10) 


in which pË is a constant, f stands for the returns of the np factors, and e is an unexplained 
error. Once we have estimated such a model for every asset, we collect the 6“ loadings in a matrix 
B of size na x np. The variance—covariance matrix of the assets is then given by 


BCov(f)B’+D. (14.11) 


The expression Cov (f) stands for the variance—covariance matrix of the factor returns, and D is a 
diagonal matrix with the assets’ idiosyncratic variances on its main diagonal. This latter matrix is, 
by construction, of full rank; thus, the overall matrix will be of full rank. As an example, suppose 
we use only a single factor to model returns, which is not unusual when it comes to stocks. Then 
the first part of Eq. (14.11) will simplify to the outer product of the vector of 6 estimates of the 
assets, times the variance of the single factor (a scalar), which is a matrix of rank one. However, D 
will have full rank, and adding both matrices results in a full rank matrix overall. 
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Shrinkage is similar. Shrinkage essentially means that we replace the standard estimator for 
the variance—covariance matrix with a linear (convex) combination of the standard estimator and 
a more structured estimate. The latter may be a simple diagonal matrix, for instance; or it may be 
itself the variance—covariance matrix derived from a usually-simple factor model. 

It is important to note that factor models and shrinkage estimators cannot conjure up information 
that the standard estimator could not see somehow. Rather, full rank is imposed by the Analyst. For 
factor models, the assumption is that all covariance can be attributed to common factors, and so D 
is diagonal. In the case of shrinkage, we assume that the structured estimator that we combine with 
the standard estimator is an appropriate (and full rank) one. 

Finally, another approach is to accept the rank-deficiency and its implicit reduction in dimen- 
sionality. Suppose we wish to create random samples of Gaussian variates that follow a specific 
variance—covariance matrix X}. The canonical way to do that is to 


1. create a matrix X of standard Gaussian variates without any specific correlation (each row holds 
one observation), 

2. compute the Cholesky factor of the desired variance—covariance matrix, and 

3. post-multiply X by the Cholesky factor. 


If & is rank-deficient, we can use a factorization of & that does not require full rank, such as the 
eigenvalue decomposition (see Section 7.1.1). The downside of an alternative decomposition is that 
it might be expensive. Also, since we want to sample, we would create more random numbers than 
we actually need. 

Let us provide a numeric example. We call our original data matrix M. 


## columns 
## rows 


> me <> 3 
= inte <= A 


> M <- array(rnorm(nr*nc), dim = c(nr, nc) ) 
= © <= ame (0.5, chim = Gime, me) )) 
> diag(C) <- 1 
> M <- M %*% chol (C) 
SW <= M a, it, 2, 3) 
= Clore (i) 

[1] [,2] [y3d [,4] [5] 
[1,] 1.000 1.000 1.000 0.232 0.279 
[2,] 1.000 1.000 1.000 0.232 0.279 
[3,] 1.000 1.000 1.000 0.232 0.279 
[4,] 0.232 0.232 0.232 1.000 0.447 
[5,] 0.279 0.279 0.279 0.447 1.000 


M is rank-deficient by construction, since columns 2 and 3 are copies of column 1. 


> head(M, 3) 


[,1] [,2] [,3] [, 4] [,5] 
[15] -=+0.235 =-0.235 =0;235 1.2259 =-0.543:8 
[2;] 2.206 1.206 1.206 =-0.355 =-0.0165 
[3,] -0.658 -0.658 -0.658 0.871 1.8449 


If we were to compute a correlation matrix from these columns, we would only need a three by 
three matrix. The function colSubset will do all the work; the function is included in the NMOF 
package. 


> colSubset 


[rank-deficient] 


[colSubset] 
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function(x) { 


nr <- dim(x) [1L] 
nc <- dim(x) [2L] 
QR <- qr(x, LAPACK = FALSE) 


if ((qrank <- QRSrank) == nc) { 
list(columns = seq_len(nc), 
multiplier = diag(nc) ) 
} else { 
cols <- QRSpivot [seq_len(qrank) ] 
cols_ <- QRS$pivot[(qrank + 1L):nc] 
D <- array(0, dim = c(qrank, nc) ) 


D[cbhind(seq_along(cols), cols)] <- 1 
D[ ,cols_] <- qr.solve(x[ ,cols], x[ ,cols_]) 
list(columns = cols, 


multiplier = D) 


} 


<environment: namespace :NMOF> 
We call the function on M. 


[colSubset-example] > colSubset (M) 


Scolumns 
ELI- 145 
Smultiplier 

[,1] [,2] [,3] [,4] [,5] 
Fy] 1 1.00e+00 1.00e+00 0 0 
[2x 0 8.82e-17 8.82e-17 1. 0 
L333) 0 -3.25e-17 -3.25e-17 0 1 


It returns a list of two components: a full-rank subset named columns that spans the column space 
of the original matrix, and a matrix multiplier. 

To create random numbers that match the correlation structure of M, we create random numbers 
as in the textbook version, but we use a correlation matrix that uses only the full-rank column 
subset. Then we postmultiply by multiplier. 


> css <- colSubset (M) 

> C <- cor(M[, css$columns] ) 

= sale! <= iavexoyll ((e)) 

= inne <= OONO 

= Of <a Elisicrehyy aaan aa) , sola) = (Galas, axe) )) 
> X <- X %*% chol(C) 

> X <- X $*% cssSmultiplier 

S Clone (2:0) 


[,1] [,2] [,3] [,4] (,5] 
[1,] 1.000 1.000 1.000 0.244 0.256 
[2,] 1.000 1.000 1.000 0.244 0.256 
[3,] 1.000 1.000 1.000 0.244 0.256 
[4,] 0.244 0.244 0.244 1.000 0.452 
[5,] 0.256 0.256 0.256 0.452 1.000 
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To summarize: when we encounter numerical difficulties with variance—covariance matrices, 
then our problem is almost always empirical. This is in particular true if the difficulties result 
from finite-precision arithmetic. In other words, when our computer finally complains, we have 
probably gone too far, anyway. On modern computers, we can still compute solutions to problems 
when—empirically—we should do so no more. See also the regression/conditioning example on 
page 27. 


14.3 Optimization with heuristics 


In all the examples in the last section, we were constrained when we set up the model. The objec- 
tive could only be linear or quadratic; the constraints had to be linear. To be sure, there are other 
variants of mathematical programming, but they all put restrictions on the functional form of the 
objective and the constraints. But what if we wanted to include a straightforward constraint like 
“every pairwise correlation should be below 0.7”? Or think of the simple asset selection problem 
we introduced in Chapter 12. We were to pick K out of na assets such that these assets optimize 
a given criterion. Such models can still be solved numerically, but we need other methods. In this 
section, we now turn to such alternative techniques: heuristics. 


14.3.1 Asset selection with Local Search 


The problem 
We will first go back to the problem introduced in Section 12.6.2. 


Aim: select between Kymin and Kmax out of na assets such that an equally weighted portfolio has 
the lowest-possible variance. 


The formal model is: 


min w’Xw (14.12) 


subject to the constraints 


wj=l/Kk_ forjeJ, 


K min < K < K max . (14 13) 


The symbol J stands for the set of assets in the portfolio, and K = #{J} is the cardinality of this 
set, i.e. the number of assets in the portfolio. 

A simple technique for this problem was described in Section 12.6.2: Local Search. We start 
with a random portfolio and evaluate it. Then we iteratively change (or perturb) this portfolio until 
some criterion tells us to stop. The strategy is summarized in Algorithm 53. Within the algorithm, 
we use the symbol x for a solution. (Within the actual code, solutions are also usually denoted x.) 
The function LSopt, contained in the NMOF package, implements a Local Search. A shortened 
version is given in the function LSopt., in which the final "." serves as a reminder that the 
function is abbreviated. 


= LSC. <= Winetiom(Or, alee = Insie(), coo) # 
xc <- algo$x0 
Ol <= OWES, aoo) 
for (s in seq_len(algo$nS)) { 
xn <- algoSneighbour(xc, ...) 
ne == OF (sn, 2.) 
if (xnF <= xcF) { 
xc <- xn 
XCF <=- XOF 


[LSopt.] 


{const-cor] 
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Algorithm 53 Local Search for asset selection. 
1: set Ngteps 
2: randomly generate current portfolio x° 
3: for i = 1: Ngteps do 
4 generate x" € N (x°) and compute A = ®(x") — (x°) 
5 if A <0 then x° =x" 
6: end for 
7 


. xsl = y€ 


} 
list (xbest = xc, OFvalue = xcF) 


} 


To run Local Search we need 


e amethod for evaluating a portfolio (the function ® in Algorithm 53); 

e a method for changing a given portfolio (the function M in Algorithm 53); 
e arule that tells us whether to accept a new solution; 

e finally, we need a rule that tells us when to stop our search. 


We will discuss these ingredients in more detail below; but first let us describe how these ingre- 
dients are passed to LSopt. 
LSopt is called with three arguments. 


= OJE (OP, alge = LSE oss) 


The ... enables us to pass other objects into the algorithm; all such objects will be passed on 
to the objective function and the neighborhood function. In this chapter, we will always collect such 
additional variables in a list Data, hence we call LSopt with arguments OF, algo, and Data. 


OF is the objective function. 

algo isa list that holds the settings of the local search. There are only few: an initial solution x0, 
a neighborhood definition, and the number of iterations (or steps, hence nS). 

Data is a list that holds the pieces of data necessary to evaluate the objective function. 


In each step, the algorithm calls the neighborhood function, and evaluates the new solution through 
a call of OF. In the code, we use the notation xc for the current solution, and xn for the neighbor 
solution. LSopt returns a list with several components, among them: 


xbest is the solution. 

OFvalue is the objective function value associated with xbest. 

Fmat is a matrix with two columns. The first column contains the solution quality of the new 
solution for every step. The second contains the value of the current (i.e., accepted) solution. 
This must also be the best solution overall. 


Now let us go through the given problem step by step. 


Coding and evaluating a portfolio 


Conceptually, a portfolio is a vector. Since we look for an equal-weight portfolio, a solution x can 
be coded as a logical vector: a TRUE means the asset is included. 

Let us start by creating some random data. We set a number of assets, construct a correlation 
matrix with constant pairwise correlation (see function const_cor below), and draw volatilities 
for our assets from the range 20 to 40 percent. We obtain Sigma, the variance—-covariance matrix 
for na = 500 assets. We want to select a portfolio of between 30 and 60 assets. We collect all 
information in the list Data. 


= Comnet Cor <= PTni, m) 1 
@ z= ariaya, Ckm = Clim, a) 
diag(C) <- 1 


max = 


C 
} 
> na <- 500L ## create random data 
= C <= eont Cor En DA) 
> vols <= umie (ine, ma = OA, 
> Sigma <- outer(vols, vols) * C 
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0.4) 


As we said above, all pieces of information are collected in a single list. 


> Data <- list(Sigma = Sigma, 


imalig, = SO 
Kmax = 60L, 

na = na, 
n.changes = 1L) 


Let us create a random solution: we first sample a random cardinality; and then select assets. We 


use the R function sample. Note that we hedge for the case that 1 
DataSKmax. 


> if (Data$Kmin == Data$Kmax) 
n.assets <- Data$Kmin else 
n.assets <- sample(Data$Kmin:Data$Kmax, 1L) 


0) <= 


WOW NE WY 


table 


x0 


assets <- sort(sample(na, 


xO[assets] <- TRU 


logical (na) 


Ly 


(x0) 


FALSE TRUE 


445 


55 


n.assets) ) 


Data$Kmin is not smaller than 


Given such a solution, we obtain weights by dividing each element of the vector by sum(x), 
which will automatically coerce the logicals to numeric. Such a portfolio w is evaluated as stated 
in Model (14.12), that is, given the variance—covariance matrix © and w, we compute the portfolio 
variance w’ Xw. We put the computation into a function OF. 


> OF <- 
w 
© 


} 


function(x, Data) { 
<- x[x]/sum(x) 
(w $*% DataSSigma[x, x] 


Actually, let us test two alternatives. 


> OF_alternative <- function(x, 


WwW 


res <- crossprod(w, 


C 


> all.equal(OF(x0, Data), OF_alternative(x0, 


[1] TRU] 


<- x[x]/sum(x) 


(tcrossprod(w, res)) 


EJI 


> library ("rbenchmark") 
> benchmark (OF (x0, Data), 


OF_alternative(x0, 


SxS w) 


Data) 


Data), 


{ 


Data$Sigma[x, x]) 


Data) ) 


[random-data] 


[Data] 


[random-solution] 


[OF] 


[OF-simple] 
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replications = 10000, 
omeen = relativa“ il, @(l,3,4) 1 


test elapsed relativ 
1 OF (x0, Data) 0.147 1.00 
2 OF_alternative(x0, Data) 0.298 203 


We call the objective function with two arguments: a solution, and the list Data. The list Data 
contains the objects we need to evaluate a solution; in the example, this is just Sigma. 

Depending on the version of R you use (in particular, the linear-algebra library it uses), the 
functions crossprod and tcrossprod may be more efficient than the %*% operator. Note also 
that we do not use the full vector w, but only those elements that are greater than zero. 

An aside: for this special case, an even simpler objective function, which only works for equal 
weights, is given in the following code: 


> OF_simple <- function(x, Data) { 
w <- 1/sum(x) 
sum(w * w * Data$Sigma[x, x]) 
} 
> all.equal(OF(x0, Data), OF_simple(x0, Data)) 


[1] TRUE 


> benchmark (OF (x0, Data), 
OF_alternative(x0, Data), 
OF_simple(x0, Data), 
replications = 10000, 
order = Yrelarivati (|, @Cil,3,4))] 


test elapsed relativ 


1 OF (x0, Data) 0.125 1.00 
3 OF_simple(x0, Data) 0.149 1.19 
2 OF_alternative(x0, Data) 0.286 2.29 


Changing a portfolio: the neighborhood 


Now we come to the most important part of Local Search, the neighborhood M. The neighborhood 
defines what a solution “close to” another solution means. Formally, the neighborhood of a given 
solution is a subset of the solution space such that a given distance criterion is met. In the extreme, 
the neighborhood of a portfolio could contain all other possible portfolios, but then we would actu- 
ally do a random search because from any portfolio we could step to any other feasible portfolio. In 
general we will choose the neighborhood such that the objective function exhibits “local behavior.” 
This means that ®(\/(x)) should, at least on average, be close to ®(x); e.g., when compared with 
randomly chosen solutions. 

For our problem, as a measure for the distance between two portfolios, we can use the number 
of different positions, the so-called Hamming distance. A portfolio is a vector x of length n4, 
so if x[j] is TRUE, then asset j is in the portfolio; if x[j] is FALSE, it is not included. A 
neighborhood can be defined by the number of different bits. 

When we speak of a neighborhood function when it comes to the code, this will always mean 
a mapping from a given portfolio to one specific other portfolio. That is, the function will take one 
solution, copy it, randomly modify the copy, and return the modified copy. So in every iteration the 
neighborhood function is called with the current solution xc and returns one neighbor solution xn. 

With our encoding it is straightforward to generate neighbors: randomly pick a j, and change 
x[j]. This is summarized in Algorithm 54; R code is in the function neighbour. 
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> neighbour <- function(xc, Data) { 
xn <- xc 

p <- sample.int(DataSna, DataSn.changes, 

replace = FALSE) 


xn[p] <- !xn[p] 


## reject infeasible solution 


if( (sum(xn) > Data$Kmax) | | 
(sum(xn) < Data$Kmin) ) 
xc 

else 
xn 


Algorithm 54 Neighborhood (asset selection). 


1: randomly select j € 1,..., nA 
2: if xj= TRUE 

3: set xj = FALSE 

4: else 

5: set xj= TRUE 

6: end if 


We pass a variable n. changes with Data, which gives the number of bits that are changed. 
Note also that we handle the constraints in this neighborhood function: if the new solution violates 
a constraint, we reject it, and return the old solution. 

So, the preparations are done already; we have 


e amethod for evaluating a portfolio: the function OF, and 

e amethod for changing a portfolio: the function neighbour. 

e We need an acceptance rule: In Local Search we accept a new solution if it is better (at least not 
worse) than the previous solution. 

e We stop the search after a predefined number of steps nS. 


Running Local Search 


We run the example, for which we reuse the data we created before. Recall that we collected all 
information in the list Data. 


> str (Data) 


List of 5 
$ Sigma : nam [12500,- 12500] 0.122173 0.0597 0.0771... 
$ Kmin : int 30 
$ Kmax : int 60 
$ na : int 500 


$ n.changes: int 1 


All settings—the initial solution x0, the neighbour function, and number of steps nS—go into 
the list algo. 


> alle < lige (s<0) = 5x0), 
neighbour = neighbour, 

ToS) SS SWOON, 
printBar = FALSE) 


Finally, we can run the algorithm. 


[neighbour] 


[str-Data] 


[algo] 


[both-LS] 


[progress] 
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> set.seed (48957) 
> soll <- LSopt (OF, algo, Data) ## the function in NMOF 


Local Search. 

Initial solution: 0.0515 
Finished. 

Best solution overall: 0.0264 


> set.seed (48957) 
> sol2 <- LSopt.(OF, algo, Data) ## the abbreviated function 
> all.equal(soll$xbest, sol2$xbest) 


[1] TRUE 


Let us look at the results. 


> sqrt (sol1lSOFvalue) 


ELI 0x163 
We plot how the objective function has evolved over the course of the optimization. 


> do.call(par, par.portfolio) 
= parmar = @(3, 5, i, LI) 
> jolloc(seie (Solllsimee |, 21), Eyasi", 
ylas = NWiwerstricoylale yvolarilicy", sxelleley = TMitteeicevcaoia”)) 
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Let us restart the algorithm several times. With 30,000 steps and 500 assets, one run takes less 
than a second. We let the algorithm run not just once, but 100 times. We can use the function 
restartOpt for that. The function restartOpt. is an abbreviated version.’ 


= cepret Oora <= Ceto cia id, Ol, aleo = milli, oo.) ff 
n <- as.integer (n) 
stopifnot(n > OL) 
allResults <- vector(’list’, n) 
for (i in seq_len(n) ) 
allResults[[i]] <- fun(OF, algo, ...) 
allResults 
} 


The function restartOpt takes an optimization algorithm fun (here, LSopt), a number n of 
restarts, and the arguments for fun. It is a simple wrapper for a loop in which fun is called n 
times (except that restartOpt may also run the restarts in parallel). The function returns a list 
allResults. With this, we can reproduce the results presented in the previous chapter. 


7. In fact, the original version in the NMOF package looked pretty much exactly like restartOpt .; see https://r-forge.r- 
project.org/scm/viewve.php/pkg/NMOF/R/restartOpt.R ?view=markup&revision=2&root=nmof. 
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> algoSprintDetail <- FALSI 
> trials <- 100L 
> allRes <- restartOpt(LSopt, n (Eien euls), 
Or algo = aleo Dere = DACE) 
> allResOF <- numeric (trials) 
S Hor (a dim istcienells) 
allResOF[i] <- sqrt(allRes[[i]]$OFvalue) 


E 


Local Search is simple, but it is already powerful enough to produce good solutions to the 
problem here. A disadvantage is that it might get stuck in local minima. The following figure 
shows objective functions of several restarts over 30,000 steps. (These values are stored in the 
matrix Fmat that is returned when LSopt is called.) If we were really concerned with the fact the 
solutions do not converge on a single point, we could make sure by using the method we outline 
next; or we could use several restarts. Practically, for this case, this does not seem necessary. But 
note that it is not necessarily the case that Local Search alone works so well. 
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In the coming sections, we shall explore examples for which more powerful methods are needed. 


14.3.2 Scenario Optimization with Threshold Accepting 


In this section, we discuss generic portfolio optimization models with Threshold Accepting (TA). 
TA was introduced in Dueck and Scheuer (1990) and Moscato and Fontanari (1990). It was in- 
cidentally the first heuristic applied to portfolio selection problems (Dueck and Winker, 1992); 
see also Gilli and Schumann (2010e). We start by describing the general approach—scenario 
optimization—and then move to several examples. 


Ingredient 1: handling scenarios 


Recall that in the standard mean-variance model, we work with the properties of single assets— 
means, variances, and correlations—independently of specific portfolio weights. All information 
is condensed into an expected return vector of length n4, and a variance—covariance matrix of size 
na Xna. Computing time will increase with the number of selectable assets, but not with the amount 
of data that goes into the computation of the means, variances, etc. Importantly, we capture the rel- 
evant information about the distribution of assets independently of the specific portfolio chosen. 
This is highly convenient—it is the reason for the success of mean-variance optimization—but it 
rarely generalizes to other specifications, or only in ways that are cumbersome and computationally 
very expensive. See, for instance, Jondeau et al., 2007, Chapter 9, for how to set up skewness or 
kurtosis “matrices.” In fact, for skewness we actually have a cube, that is, a three-dimensional ar- 
ray; for kurtosis we have a four-dimensional array. See also Konno et al. (1993), Konno and Suzuki 
(1995). 

Fortunately for us, there is a much more flexible approach: scenario optimization. For scenario 
optimization, we assume that we can obtain a return sample r”? = [r] r} ... fagl’, that is, a sample 
for a specific portfolio. Given such a sample we can easily evaluate any objective function; for 
instance, compute partial moments, and, hence, evaluate the portfolio. In essence, scenario opti- 
mization is equivalent to working with the empirical distribution function of the portfolio returns, 
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but this is not restrictive: if we preferred a parametric approach, we would estimate the necessary 
parameters from the scenarios. 

Let us go through the idea in a bit more detail. Suppose we have n, assets, and ng return sce- 
narios. For now, think that the latter are simply historical returns. We collect all data in a matrix R 
of size ns X ną. One row of R thus stores the returns for one state of nature. We can equivalently 
work with price scenarios, computed as 


P =(1+R) x diag(po) 


where 1 is a matrix of ones of size ns x na, diag(-) is an operator that transforms a vector into a 
diagonal matrix, and po is the vector of initial prices. Note that the columns of P are not time series. 
Rather, every row of P holds the prices for one possible future scenario that might occur, given 
initial prices of po. In fact, for many objective functions (e.g., moments) it is not relevant whether 
the scenarios are sorted in time, since such criteria only capture the cross-section of returns. A 
portfolio is either described by u (the holdings), or w (the weights). The portfolio values under the 
scenarios can then be obtained by v = Pu; equivalently, we can compute returns r? = Rw. 

For selection criteria such as drawdown that need a path of portfolio wealth, we need to work 
with time series. For simplicity, assume again that we work with historical data, and arrange the 
prices in a matrix P?" of size (T + 1) x na, where each column holds the historical prices of one 
asset. The portfolio values over time are still computed by v = P?"u. But now, v is no longer a 
sample of portfolio returns across the scenarios, but a time series of portfolio wealth. 

To stress the difference between Pu and P*u: Pu gives a sample of portfolio values over the 
cross-section of scenarios, while P?**u gives one path of the portfolio value from time 0 to time T. 
For both scenarios and paths, we transform the prices or returns of the single assets into one vector v 
of portfolio values. Given v, any objective function can easily be computed. Thus, we can easily 
evaluate a given portfolio u. So we start with an initial guess u for a portfolio and then change it 
iteratively until we like the resulting vector v. This is the essence of scenario optimization; all the 
rest is detail (Dembo, 1991, Maringer, 2008b, Maringer and Parpas, 2009). 

A disadvantage of scenario optimization is that the computation of Pu needs to be performed 
in every iteration of an algorithm; so potentially tens or hundreds of thousands of times. Since the 
matrix R is of size ns x na, the computing time will increase with the number of assets, and also 
with the number of scenarios. The matrix R may be large and so the multiplication is expensive. 
Later on we will discuss how to speed up this computation. 

Before we go on, a final remark on scenarios. Setting up scenarios is a crucial part of the 
optimization. The simplest and probably most widely used way is to use historical data. We can 
also resample. Resampling is possible for time series, too: we could, for instance, estimate a model 
that captures serial dependencies (such as a member of the GARCH family) and then resample 
from its residuals; or we may use a block bootstrap method. See Section 6.7 for possibilities. In the 
following examples we use the relatively small data set fundData that comes with NMOF, which 
we described on page 367. 


Ingredient 2: building blocks for selection criteria 


Suppose we have a scenario set, then next we need a selection criterion ®. In essence, anything is 
possible (though not everything is a good idea), so we rather discuss building blocks for alternative 
reward and risk functions, that is, building blocks for ®. 

We can combine several criteria, for instance, in ratios. Ratios are easy to communicate and 
interpret (Stoyanov et al., 2007); when associated with an optimized portfolio, they always have a 
certain optimality feature, like “the maximum reward for a unit of risk,” or the “minimum risk for 
a unit of reward.” With ratios, we always need to safeguard our objective functions for cases where 
numerator or denominator could switch signs while moving through the search space. Ratios that 
use the mean return for reward, for instance, are often not interpretable anymore if mean returns 
are negative. Also, we need to think about what happens when either the numerator or denominator 
become zero. 
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If we do not like ratios we can use linear combinations: 
(u) =a, 0, +a29... 


in which the ®; are different criteria and the œw; are weights. Actually, any combination (nonlinear 
ones as well) is possible; but it is often more convenient to keep ® simple. 


Moments 


Moments are a natural way to condense sample information into a single number. A moment can 
be defined as 


average (Ga — threshold)" ) 


with threshold and k being scalars. Most often, the expectation is taken for average, but we may as 
well use the median. The kth moment is the expectation of a random variable raised to a power k, 
which need not be an integer. 

For a sample, computing expectations reduces to summing returns. We have already used the 
mean and variance; higher moments such as skewness and kurtosis are possible as well. Moments 
with k > 1 are usually centered around their mean. Practically, when the mean is very small (e.g., 
for daily returns), it does not matter much whether it is subtracted or not. Not including it may 
even be more appropriate sometimes; the famous example is the stock price that rises 1% per day, 
every day. If we define the stock’s variability as deviation around the average return, then there is 
no variability. But apparently this is not how traders understand volatility. 


Partial moments 

Partial moments are a convenient way to distinguish between returns above and below a desired 
return threshold r4, and to capture potential asymmetry around this threshold. For a sample of ns 
return observations, partial moments PM, (rg) can be computed as 


PM? (ra) = Es > (P= ra), (14.14a) 
n rPsrgq 

PM; (ra) = 1 > (ra= ry. (14.14b) 
Á rP? <ra 


The superscripts + and — indicate the tail (i.e. upside or downside). Partial moments take two more 
parameters, an exponent y and the threshold ra. The expression “r? > ra” indicates to sum only over 
those returns that are greater than ry. 

A well-known partial moment is the semi-variance PM, (m?), or more generally PM, (ra). The 
square root of this expression, sometimes called downside deviation, is used as the risk function 
in several performance measures; e.g., the Sortino and the Upside Potential ratio (Sortino et al., 
1999). The Sortino ratio, to be maximized, is defined as 


m? — rg 
„PM, (ra) 
for a fixed ry. The Upside Potential ratio, also to be maximized, is defined as 
PM} (ra) 


PMG (ru) 


When we minimize, we turn the ratios upside down. 


{part-cond] 
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Conditional moments 


Conditional moments can be calculated as 


1 
CM? (r) = ———— Pon), 14.15 
ya) = sri 2 (era EORR 

1 
M7 (ra) = ———— -rP . 14.15b 


Again + and — indicate the tail, and “#{r? > r4}” stands for the number of returns greater than rg. 
We could also recombine such moments: suppose we wish to minimize variability, but we would 
like to penalize the upside less than the downside. Then we could use something like the following: 


1 > 1 
mea eee 
r d d 


Per, r?sr 


Conditional and partial moments are closely related. For a threshold r4, the lower partial moment 
of order y equals the lower tail’s conditional moment of the same order times the lower partial 
moment of order zero: 


PMY (ra) = CM¢ (ra) PMG (ra). 
PM,, (ra) = CM, (ra) PMo (ra). 


The partial moment of order zero is simply the probability of obtaining a return beyond ry. So, in 
words, conditional moments measure the magnitude of returns around ry, while partial moments 
also take into account the probability of such returns. For a fixed ry, conditional and partial moments 
convey different information because both the probability and the conditional moment need to be 
estimated from the data to obtain a partial moment. 


> x <- rnorm(20) 


## mean and probability of loss 
theta <- 0.1 # ... or mean(x) 
prob.loss <- ecdf (x) (theta) 
exponent <- 2 


WOW WE We 


> ## conditional moment (CM) 

> (cm <- mean((x[x < theta] - theta) *exponent) ) 
[1] 1.46 

> ## partial moment (PM) 

> xx <- x - theta 

So po<(poe 5 Ol) <=> 0) 

> (pm <- mean(xx*exponent) ) 


FLIp04 54 


> ## relationship between PM and CM 
> all.equal(cm * prob.loss, pm) 


[1] TRUE 
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FIGURE 14.2 Portfolio return distributions: the influence of rq. 


Partial and conditional moments, in our definitions, are centered around ry. If r4 is fixed, we can 
directly work with Eqs. (14.14) and (14.15). To give an example, Fig. 14.2 (left panel) pictures the 
return distributions of two portfolios. If we aim to find a portfolio that minimizes a function of the 
returns below r4, there should be little debate which portfolio to choose. 

Indeed, it is intuitive to fix ry at a specific level such as zero. Alternatively, we could use quan- 
tiles to determine 7,, for instance, the loss (negative return) that is only exceeded with a given 
probability. This convention is often found with conditional moments. In that case, we cannot use 
Eq. (14.15), since fixing a quantile says nothing about the location of the return distribution that the 
algorithm selects. This is illustrated below in Fig. 14.2 (right panel): in that picture we would cer- 
tainly prefer the more variable distribution but a decision rule based on a moment centered around 
the quantile would choose the dominated one on the left. 

The simplest remedy is not to center around rg, but to set 


= _ 1 P\y 
rPsrg 

O O 1 Pay 
r? <ra 


Without centering, we have no more guarantee for the sign of r”. Hence, for y Æ 1, we replace r? 
with max(r?, 0) in Eq. (14.16a), and by min(r?, 0) in Eq. (14.16b). 


Quantiles 


A quantile of a sample r? = [r] r} ... rng] is defined as 
Qq = CDF !(q) = min{r’ | CDF(r’) > q}, 


where CDF is the cumulative distribution function and q may range from 0% to 100% (we leave out 
the percent sign in subscripts). Thus, the qth quantile is a number Qg such that q of the observations 
are smaller, and (100% — q) larger than Qq. For a given sample, we will usually have several 
numbers that satisfy this definition (Hyndman and Fan, 1996). The simplest approach is to use 
the order statistics of the portfolio returns [rę D ro) inh Tins yl , that is, a step function. If £ is the 
smallest integer not smaller than q x ns, then the qth quantile is the max(£, 1)th order statistic. 
This is consistent with the convention in many implementations that Qo is the minimum of the 
sample (an estimator for the worst-case return), and Q100 is the maximum. Importantly, when we 
are interested in the tails of the distribution, we can only work with order statistics when we have 
a large enough sample. Suppose, for instance, we had only 20 observations: estimating Q5 via an 
order statistic would not make much sense. In fact, using a parametric approximation like setting 
the quantile equal to some multiple of volatility would be much preferred in such a case. 

The most famous quantile in finance is Value-at-Risk (VaR). VaR is the loss only to be exceeded 
with a given, usually small, probability at the end of a defined horizon. Thus, VaR is a quantile of 
the return distribution. 

Quantiles can also be used as reward measures; we could maximize a higher quantile (e.g., the 
90th). We need to be careful when we construct ratios with quantiles, since ideally we would want 
to maximize all quantiles, i.e. move the return distribution to the right. One way to build a ratio is to 


[abs-drawdown ] 


{[relative-drawdown1 ] 


388 PART | III Optimization 


Algorithm 55 Maximum drawdown of a time-series. 
1: sethigh= P] 


2: set maxdown = 0 
3: for t = 2: length(P) do 
4: if P, > high then 
5: high =P; 
6: else 
T: compute underwater = 1 — P;/high 
8: if underwater > maxdown then 
9: maxdown = underwater 
10: end if 
11: end if 
12: end for 


13: return maxdown 


consider deviations from a quantile in the middle of the distribution (the median being the natural 
candidate). If we use quantiles far in the tails, we can form ratios of the form —@1o/Q); and assume 
(and check numerically) that Qjo is below zero, and Qp; is above. 


Drawdown 


The functions described so far are applied to the distribution of final wealth. We may as well 
observe the evolution of a portfolio over time. A useful function of the portfolio’s time path is the 
drawdown. Let v be a time series of portfolio values, with observations at t = 0,1,2...7. Then 
the drawdown DD of this series at time t is defined as 


DD, = v™* — v; (14.17) 


where v;"* is the running maximum, that is, v®®* = max{vy | t” € [0, t]}. 


DD is a whole vector of length T + 1. A subscript indicates a scalar value, for instance, the 
drawdown’s mean, its maximum or its standard deviation, or the drawdown at a particular point 
in time. Other functions may be computed to capture the information in the drawdown vector, for 
example, the mean time underwater (i.e., the average time elapsed between two consecutive values 
in DD that are sufficiently close to zero), or the correlation between a portfolio’s drawdown and 
the drawdown of an alternative asset, like an index. The definition in Eq. (14.17) gives DD in 
currency terms. A percentage drawdown is often preferred, obtained by using the logarithm of v, 
or by dividing Eq. (14.17) by vf. 

In R, the drawdown can be computed with the function cummax. We demonstrate this for an 
artificial time series: 


= y <— shorn (00; mean = 0201, sd= 0203) 
>v <- c(1, cumprod(1 + v)) 


Then we only need to compute: 


> absD <- cummax(v) - v 


absD is the absolute DD of the series. To compute the maximum drawdown, for instance, just find 
the maximum of this vector. 

We often want the relative drawdown, or drawdown in percentage points. There are several 
equivalent possibilities to compute it. We could take the logarithm of the time series. 


> logy <= log(v) 
> d <- logv - cummax(logv) 
> relD1 <- 1 - exp(d) 


The last line converts the continuous returns back into discrete ones. A much simpler (and faster) 
way is to compute 


> cv <- cummax(v) 
> relD2 <- 


(ev - v) 


NT 
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This is also what function drawdown in the NMOF package does. re1D2, up to numerical preci- 


sion, equals re1D1. 


o 
Ke} 


Teg) 


wi) 


0 20 40 60 80 100 


> all.equal(relD1, relD2) 
[1] TRUE 
S elo) een (joes, joie POr E COMO) 
> par(mfrow = c(2, 1)) 
= ology Cypa = WI, lelg = UT szlag 
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v = randn(1, 
v = cumprod([1 1+v]); 
plot (v) 


tmax = v(1); 

nO = length(v); 

D = NaN(noO, 1); 

for i = 2en0 
tmax = 
D(i) = 

end 


max (tmax, 


i 


0 20 40 60 80 100 


MATLAB does not have a cummax function; we need a loop. We update the current maximum 
of v when going through the observations. 


100)*0.03 + 0.01; 


v(i)); 


(tmax - v(i))/tmax; % 


or D(i) = 


tmax - v(i) 


It would be much slower (by orders of magnitude) if we used 


tmax = max(v(1l:i)); 


within the loop; that is, if we did not update the maximum. 


[relative-drawdown2 ] 


[drawdown] 
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Ingredient 3: Threshold Accepting 


Threshold Accepting builds on Local Search. The algorithm has already been described in Chap- 
ter 12; we repeat the pseudocode in Algorithm 56. We use the symbol x generically for a solution. 


Algorithm 56 Threshold Accepting. 


set threshold sequence t 
Set Mihresholas ## length of t 
Set Ngteps ## steps per threshold 
randomly generate current solution x° 
set x* = x° 
forr=1: Nthresholds dO 
for i = 1: Ngteps do 
generate x" € N(x°) and compute A = f(x") — f(x‘) 
if A <t, then x° =x" 
if f(x°) < f(x*) then x* = x° 
end for 
: end for 
: return x“ 


=. = 
FP: A 100s SION OU Es a a 


aon 
wn 


Compared with Local Search, there is only a small difference: we do not require new solutions 
to be better than the current solution. They can be worse, but only up to a threshold. For intuition, 
suppose the thresholds were very large; then, our optimization algorithm would become a random 
walk and any new portfolio would be accepted. On the other hand, if the thresholds were too small, 
the algorithm would be too restrictive, and become stuck in local minima; for zero thresholds, we 
get exactly a Local Search. So the thresholds need to be connected to the step size for the algorithm, 
which is defined by the neighborhood. 

Algorithm 56 uses two nested loops. The outer loop controls the size of the threshold, and the 
inner loop runs a number of steps for each threshold. (We could also use only one loop and change t 
as a function of the iteration.) We use the word step when we explicitly refer to one step of the inner 
loop; we use iteration for the product Nyounds X Msteps- (Note that for Local Search, step and iteration 
would be the same.) 

An implementation of TA, called TAopt, is contained in the NMOF package. Like LSopt 
before, the function is called with three arguments: 


TAopt (OF, algo = list(), ...) 


Again, we collect all information other than algo in a list Data. In our experience, directly pass- 
ing all arguments in one list makes for a clearer structure, and is less error prone. 


OF is the objective function. 
algo isa list that holds the settings of the TA algorithm (discussed below). 
Data is a list that holds the pieces of data necessary to evaluate the objective function. 


TAopt runs for a fixed number of iterations, then it returns a list with several components, among 
them: 


xbest is the solution. 

OFvalue is the objective function value associated with the solution. 

Fmat is a matrix with two columns. The first column contains the solution quality of the new 
solution x" in any iteration. The second contains the objective function value of the current 
(i.e., accepted) solution. Other than in LSopt, this does not necessarily coincide with the best 
solution overall. The best solution can be found by computing cummin(Fmat[, 2]). 

vT are the thresholds (discussed below). 


Like with LSopt, we operate on solutions through functions that we define beforehand. Hence 
x need not just be a vector, but can be a list or another data structure. For instance, with many 
available assets and a small portfolio, it may be more advantageous to only store the weights of 
the included assets. To use TA for a specific problem, we need to decide about several settings and 
parameters, quite similar to Local Search before. 
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Initial solution Provide a random guess. We could even have TAopt draw a random solution. 
Unfortunately, getting a feasible solution may not be trivial for a constrained problem. If we 
use penalties, this is less problematic—as long as there exist feasible solutions—since the 
algorithm will be enticed to find a way back to a feasible solution. In any case, if TA (or any 
other heuristic) works correctly, finding a good solution should not depend on “good starting 
values” (Gilli and Schumann, 2010c). 

Neighborhood The key ingredient of Local Search and also of TA. The neighborhood function 
defines how we move from one solution to the next. In principle, it is simple because for 
portfolio selection problems we have a natural way to create neighbor solutions: we pick one 
asset in the portfolio randomly, “sell” a small quantity of this asset, and “invest” the amount 
received into another asset. If short positions are allowed, the chosen asset to be sold does not 
have to be in the portfolio. The “small quantity” could be a random variate (e.g., a uniform 
over 0% to 0.5%), or a small fixed fraction (e.g., 0.2%). In our experience, using a fixed 
fraction is often somewhat more efficient, even though, for practical purposes, both methods 
give similar results. The neighborhood gives us a lot of flexibility (unlike, for instance, in 
methods like Differential Evolution, where the rules for how to change solutions essentially 
define the method). But it also requires some time to come up with good neighborhoods. 

Thresholds The thresholds are another important element of TA. In principle, it is clear what 
we need: the threshold sequence is an ordered vector of positive numbers that decrease to 
zero or at least become very small. (In fact, thresholds need not be monotonous, but can 
increase as well.) We need to set the number of thresholds, and their levels. Thresholds 
are closely connected to the neighborhood definition. Larger neighborhoods, which imply 
larger changes from one candidate portfolio compared with the next, need be accompanied 
by larger initial threshold values, and the other way around. Fortunately, once we have settled 
on a neighborhood, there exist procedures to compute the thresholds from the neighborhood. 
Below we will describe such algorithms; thus, we only need to decide how many thresholds 
we want. 

Iterations The number of iterations (number of thresholds x number of steps per threshold). 

Stopping criterion We use a simple stopping criterion: a fixed number of function evaluations, 
given by the product number of thresholds x number of steps per threshold. We can often 
reduce computing time by adding a break if, for instance, there was no change in the best 
solution over a number of iterations. TAopt provides an optional argument OF . target: if 
specified, the algorithm stops once a solution is found that has the desired objective function 
value (or is even better). 

Constraints TA is very flexible when it comes to enforcing constraints. All the strategies discussed 
in Section 12.5 can be applied: we can repair solutions, transform them, or penalize them. By 
choosing an appropriate neighborhood function, we can also create solutions such that they 
conform with the constraints. 


So to get started we need an objective function, a neighborhood function, possibly some 
functions for constraint handling, and thresholds. We only discuss the last ingredient here—the 
thresholds—, because the method of computing them will stay the same for all examples. Objec- 
tive function, etc. are discussed in the examples below. 

Winker and Fang (1997) suggested the following method to obtain thresholds: generate a large 
number of random solutions, and select a neighbor for every solution. All these solutions are then 
evaluated according to the objective function, so for every pair (random solution|neighbor solu- 
tion), we obtain a difference in the objective function value. The thresholds are then a number of 
decreasing quantiles of these changes. The procedure is summarized in Algorithm 57. 

The number of thresholds nyoung; With this approach is usually large; hence, the number of 
steps Msteps per threshold (in the inner loop of Algorithm 56) can be low; often it is only one step per 
threshold. 

A variation of this method, described in Gilli et al. (2006) and summarized in Algorithm 58, is 
to take a random walk through the data with steps made according to the neighborhood definition.’ 


8. This procedure has also been used in MATLAB’s Genetic Algorithm and Direct Search toolbox. 
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Algorithm 57 Computing the threshold sequence—Variant 1. 
1: set Mounds (# of thresholds), deitas (# of random solutions) 
2: for i = 1 to Néettas dO 
3 randomly generate feasible current solution x° 
4 generate x" € N (x°) 
5: compute A; = |®(x") — ®(x°)| 
6 
7 
8 


: end for 
: sort Ay < Ag < ++ < Angenas 
| set T = Anyoundse -+ -> Al 


At every iteration, the changes in the objective function value are recorded. The thresholds are then 
a number of decreasing quantiles of these changes. We will use this second variant. It is also the 
method implemented in TAopt. 


Algorithm 58 Computing the threshold sequence— Variant 2. 


1: set rounds (# of thresholds), 7aeitas (# of random steps) 
2: randomly generate feasible current solution x° 

3: for i = 1 : deitas dO 

4 generate x" € N (x°) and compute A; = |®(x") — ®(x°)| 
5: x =x" 
6 
7 


: end for 
: compute empirical distribution CDF of A;, i =1,..., Maettas 
rounds —k 


8: compute threshold sequence t = CDF! ( k= 1, ooe Mrounds 


Nrounds 


Many variations are possible. For example, Algorithm 58 uses equidistant quantiles, so for 
Nyounds = 5, the 80th, 60th, 40th, 20th, and zeroth quantile are used. The convention in MATLAB 
or R is to set the zeroth quantile equal to the minimum of the sample; hence, the last threshold is 
not necessarily zero. There is some evidence that the efficiency of the algorithm can be increased 
by starting with a lower quantile (e.g., the 50th), but TA is robust to different settings of these 
parameters. Algorithm 58 is advantageous if feasible solutions are difficult to find because we need 
just one feasible initial solution.” For this second variant, Gilli and Schumann (2010b) study the 
quality of solutions obtained in portfolio optimization problems for different numbers of thresholds 
for a fixed Mounds X steps- They find that the performance of the algorithm deteriorates for very small 
numbers of thresholds (e.g., 2 or 3), but stays roughly the same for more than 10 thresholds. 


14.3.3 Portfolio optimization with TA: examples 


In this section we discuss several simple examples. These examples are not meant as “ready to go” 
applications but they may give guidance how to approach given problems. That being said, they 
may often only require minor changes to reuse the code for a given problem. 


A benchmark problem: minimizing squared returns 


It is advisable to run a freshly-implemented algorithm on problems that can be solved by other, 
reliable techniques. This helps to find errors in the implementation, and it builds intuition about the 
magnitude of randomness of the solutions for given computational resources. The prime candidate 
for such a benchmark is a mean-variance problem with only few restrictions: a budget constraint, 
long positions only, and maximum holding sizes. For such a problem, an exact solution can be 
found with quadratic programming (QP). We use solve. QP from the quadprog package. 


Aim: to find a long-only portfolio w that minimizes squared returns across all scenarios. 


9. If we use penalties, and scale the penalty function with the current objective function values, we need be cautious since 
the random walk can make the objective function explode. 
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We want to solve the following model: 


min ® (14.18) 
Ww 

wi=l, 
O<uj sui for j=1,2,..., na. 


We set w™** to 5% for all assets. ® is the squared return of the portfolio, w’ R’ Rw, which is similar 
to the portfolio return’s variance. We have 


1 
— R' R =Cov(R) + mn’ 
ns 


with Cov being the variance—covariance matrix operator that maps the columns of R into their 
variance—covariance matrix; m is a column vector that holds the column means of R, that is, m’ = 
I/ng UR (see also Section 14.2.5). For short time horizons, the mean of a column is small compared 
with the average squared return of the column. Hence we ignore the matrix mm’, and variance and 
squared returns become equivalent. 

The objective function will be proportional to the inner product w’ R’ Rw, so we can save much 
time by computing R’R beforehand. But, in general, this is not possible with other objective func- 
tions; we rather need to compute Rw in every iteration to evaluate ®. As the scenario set we use 
fundData, which consists of 500 return scenarios for 200 funds. 


Example 14.2 


Computing sums of squares. 
In many objective functions, a final step is to compute a sum of squares, that is, for some vector € 
of length n we need to compute 


n 
2. 
Ye. 
i 


With MATLAB and R, matrix computations are often more efficient than directly squaring the elements. 
Hence, replacing the explicit sum by e’e , we get the same result (up to numerical precision), but gener- 
ally the computation is faster for large vectors. 


> library ("rbenchmark") 
>n <- 50000 
> e <- rnorm(n) 
> benchmark ( 
zl <- sum(e%2), 
Z2 <- e $*% e, 
z3 <- sum(e * e), 
z4 <- crossprod(e), 
caolicatioas = 1000, emcee = trealarcive”) |, el, 3,4) ] 
test elapsed relative 
2 Z2 <- e %*% e s071 1.00 
4 z4 <- crossprod(e) 0.123 1.73 
1 zl <- sum(e%2) 0.173 2.44 
3 z3 <- sum(e >x e) 0.180 2,54 


The matrix dot product is fastest.!° A similar test script for MATLAB. 


10. In the first edition, function crossprod was fastest. Evidence that measuring matters more than mindless memorizing. 


[e-squared] 
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Listing 14.5: C-PortfolioOptimization/M/./Ch13/sumSquares.m 


2 


% sumSquares.m -- version 2010-12-23 

trials = 1000; n = 50000; ee = randn(n,1); 

tic, for i = 1:trials, z1 = sum(ee .^ 2); end, toc 
tic, for i = 1:trials, z2 = ee’ * ee; end, toc 
tic, for i = 1:trials, z3 = dot(ee,ee); end, toc 


Ne WNe 


Algorithm 59 Neighborhood (budget constraint). 
l: sete 

randomly select asset i 

set w; = Wj — € 

randomly select asset j 

set wj =wj +e 


The basic mechanism of the neighborhood is given in Algorithm 59. In each iteration, the value 
of € is determined by a draw of a random variate that is uniformly distributed over the range 
[0, 0.5%]. Practically, € should not be too small. We work with portfolio weights, but the same 
principle could be used with actual position sizes. 

The neighborhood automatically enforces the budget constraint, but the holding size constraints 
are not observed, not even the long-only constraint. (We may also select the same asset twice, 
leading to a no-change step. But such an occurrence is unlikely and leads to negligible inefficien- 
cies.) The simplest way to enforce the constraints would be through a penalty function.!! For this 
relatively simple constraint, it is easier and more efficient to incorporate the constraint directly 
into the neighborhood; see Algorithm 60. This algorithm requires that we start with a feasible so- 
lution (which may not be so simple in more complicated problems) with respect to holding size 
constraints. A general remark: there is always an infinity of possible implementations of the neigh- 
borhood, so we never know whether we found an optimal or close-to-optimal specification. But a 
tule in financial optimization (not just for heuristics) is that we can always stop searching if our 
solution is good enough. So if we have found a neighborhood that works sufficiently well, we can 
stop searching for a better implementation. 


Algorithm 60 Neighborhood (holding sizes). 


randomly select i € { assets with weight > min 
randomly select j € { assets with weight < w™a* } 
set € 


Wi = Wi — € 
wj=wj+e€ 


paa aa O 


So let us go through this example step-by-step. We start by attaching the NMOF package and 
preparing the data. All necessary information is collected in the list Data. In particular, we put in 
the actual data (the matrix R), but also, to demonstrate the possible speedup, the cross product RR. 
Note that we transpose R because we will compute Rw with crossprod. 


11. Minimizing squared returns is essentially minimizing variance. Suppose we combine two assets into a portfolio with 
weights w] and w, then the portfolio variance is going to be 


w7Var(asset I+ w3Var(asset 2) + 2w 1 w2 Cov(asset 1, asset 2) . 


If there is a positive correlation (and, hence, covariance), the algorithm will always like to short one of the assets. Which is 
not so bad if short sales are allowed, but not in a long-only portfolio. The average correlation between the funds in the data 
set that we use in this section is about 0.5. A penalty may thus not be very efficient, that is, it would require more iterations 
than a repair mechanism. The same effect would occur if we wanted to incorporate the budget constraint through a penalty 
term: a zero vector has zero variance; hence, it will require some “tinkering” until the constraint is observed. 
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> library ("NMOF" ) [setup] 
> na <- dim(fundData) [2L] 

> ns <- dim(fundData) [1L] 

> wmin <- 0.0 

> wmax <- 0.05 

> Data <- list(R = t(fundData), 


RR = crossprod(fundData) , 
na = dim(fundData) [2L], 
ns = dim(fundData) [1L], 
Goes = 05/1100, 

wmin = wmin, 


wmax = wmax) 


A neighborhood, following Algorithm 60, looks as follows. The logical vectors toSe11 and 
toBuy are transformed into indices with which. The function which is clearer and usually faster 
than the equivalent (1:na) [toSel1]. 


> neighbour <- function(w, Data) { [nb] 
eps <- runif(1) * DataSeps 
toSell <- which(w > DataSwmin) 
toBuy <- which(w < DataSwmax) 


i <- toSell[sample.int(length(toSell), size = 1L)] 
j <- toBuy [sSample.int(length(toBuy), size = 1L)] 
eps <- min(w[i] - DataSwmin, 

Data$wmax - w[jl, 

eps) 
w[i] <- w[i] - eps 
wW <- wlj] + eps 
WwW 


} 


If we used sample directly, we would need to test the length of which(toSell) and 
which (toBuy). Suppose which (toSe11) evaluates to a single integer, then sample will 
return a sample from 1:which(toSel1).The documentation of sample suggests as an ele- 
gant solution the function resample (which we have inlined in the neighborhood function, but 
shall use explicitly later). 


> resample <- function(x, ...) [resample] 
x[sample.int(length(x), ...)] 


We define two objective functions: OF 1 uses the general scenario approach, so it computes Rw in 
every step. OF2 uses the cross product R’R. 


> Ori z= function(w, Data) { [OF-1-2] 
Rw <- crossprod(Data$R, w) 
crossprod (Rw) 


> OF2 <- function(w, Data) { 
aux <- crossprod(Data$RR, w) 
crossprod(w, aux) 


J 
Now we create a random weight vector w0, collect all settings in algo, and call TAopt. 


> w0 <- runif(na); w0 <- w0/sum(w0) [TA-OF-1-2] 


> algo <- list(x0 = w0, 
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neighbour = neighbour, 
nS = 2000 
mur = EOG 
WD = SOOT; 
eg = 0.20, 


printBar = FALSE) 
> system.time(res <- TAopt(OF1, algo, Data) ) 


Threshold Accepting 


Computing thresholds ... OK 
Estimated remaining running time: 4.09 secs 


Running Threshold Accepting 
Initial solution: 0.243 
Finished. 

Best solution overall: 0.00568 
user system elapsed 

24.87 42.22 5263) 


> c(100*«sqrt(crossprod(fundData %*% res$xbest) /ns) ) 


[1] 0.337 


> system.time(res <- TAopt(OF2, algo, Data)) ## should be faster 


Threshold Accepting 


Computing thresholds ... OK 
Estimated remaining running time: 2.78 secs 


Running Threshold Accepting 
Initial solution: 0.243 
Finished. 

Best solution overall: 0.00566 
user system elapsed 

17.12 30.07 4.07 


> c(100*«sqrt (crossprod(fundData %*% res$xbest) /ns) ) 


[1] 0.337 
Finally, we compare our results with the exact solution. 


library ("quadprog" ) 
covMatrix <- crossprod(fundData) 


[RR-benchmark] > 
= 
> A <- repli; na) 
= 
= 


a<- 1 
B <- rbind(-diag(na) , 
diag (na) ) 
> b <- rbind(array(-DataSwmax, dim = c(na,1)), 
array( Data$Swmin, dim = c(na,1))) 


> system.time({ 
result <- solve.QP(Dmat covMatrix, 
dvec = rep(0,na), 
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Anati Te (aeloubiarel (2,185) }) 5 
bvec = rbind(a,b), 
meq = 1) 


}) 


user system elapsed 
0.095 0.114 0.054 


> wqp <- result$solution 


We scale the returned solution, that is, we average, take the square root, and multiply by 100. 
We wrap the result into c () so that the dimension is dropped. 


> ## compare results [check-results] 
> c(100 x sgqrt( crossprod(fundData %*% waqp)/ns )) 


[1] 0.336 


> ¢c(100 * sqrt( crossprod(fundData %*% res$xbest)/ns ) ) 


EHRE ET 


Hence, the exact solution, provided by QP, corresponds to a weekly return of 0.34%. TA provides 
us with essentially the same solution, but it needs more time to compute it. The time difference 
between OF1 and OF2 will increase when we increase the number of scenarios. 

We should also check the constraints. 


> min(res$xbest) ## TA [check-constraints] 


ELIO 


> max(res$xbest) ## TA 


[1] 0.05 


> sum(resSxbest) ## TA 


[1] 1 


> min(wqp) ## QP 


[1] -9.75e-17 


> max(wqp) ## QP 


LLI 0:05 


> sum(wqp) ## QP 


[1] 1 
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FIGURE 14.3 In-sample versus out-of-sample differences. Left: in-sample difference of a solution pair (QP minus TA) 
against the associated out-of-sample difference. Right: same, but only for portfolios for which the in-sample difference is 
less than one basis point. 


How precise should we be? 


When we run the example, we will get an objective function that is close to the exact solution, 
but it will not be the same. When we increase the number of iterations, we will get closer and 
closer to the optimum as found by QP. When is enough? To look into this question, we use again 
the data set fundData. Recall that the returns were created with a resampling procedure; thus, 
they are by construction i.i.d. returns. We randomly select 400 of the 500 scenarios and compute an 
optimal portfolio with QP, and also with TA. Then we compare both obtained solutions with respect 
to the 400 selected scenarios (“in-sample’’), and also with respect to the remaining 100 scenarios 
(“out-of-sample”). We repeat this procedure 5000 times, each time we increase the number of steps. 
The procedure is summarized here: 

1: for i = 1: 5000 do 

2: sample 400 scenarios without replacement 

3 compute optimal portfolio with QP 

4: Set Ngteps = 1 
5: compute portfolio with TA, compute in-sample difference between QP and TA 
6: compute out-of-sample difference for QP and TA on remaining 100 scenarios 
7: end for 


We compute the difference between QP and TA as 
objective function value of QP — objective function value of TA; 


hence, a negative number means that QP was better (had a lower objective function; we again scale 
the values into weekly returns). In-sample, differences could at most be zero, since QP computes 
the exact solution. Out-of-sample, positive differences are possible. 

The main objective of this experiment is to see how these differences change when we increase 
the iterations of TA. We use 10 thresholds, so we test 10 to 50,000 iterations. Actually, the procedure 
confounds two effects, the randomness from the resampled scenarios, and the randomness of the 
algorithm. Alternatively, we could have used two loops: resample, then run TA with 1 to 5000 steps. 
But the results are clear enough; see Fig. 14.3. 

On the left of Fig. 14.3, we plot the in-sample difference of a solution pair (QP minus TA) 
against the associated out-of-sample difference. If a difference is negative, QP is better (has a 
smaller risk). Not unexpectedly, there is indeed a positive relationship between the size of the 
in-sample difference and the out-of-sample difference. On the right, we zoom in on the small rect- 
angle. That’s not true, actually: the rectangle would need to be much smaller, so small that it could 
barely be recognized; because we select only those portfolios for which the in-sample difference is 
less than one basis point. With this criterion, of the 5000 portfolios, more than 3000 are within the 
right picture, so more than 3000 portfolios have an in-sample disadvantage of less than one basis 
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FIGURE 14.4 In-sample versus out-of-sample difference depending on the number of iterations. 


point. The mean absolute out-of-sample difference across these more than 3000 solutions is less 
than one-tenth of one basis point—less than 0.001%. 

Practically, we cannot directly measure the in-sample error since we typically do not know the 
global optimum. But we can use the fact that the in-sample error is correlated with computational 
resources; see Fig. 14.4. (We drop the first 1000 solution pairs, that is, those TA runs with only 10 
to 10,000 iterations.) 

On the left of Fig. 14.4, the in-sample difference (in absolute terms) decreases slowly as there 
are more iterations. Clearly, the difference can never become positive. Fortunately, we are rather 
interested in the out-of-sample differences, plotted on the right; these quickly center around zero. 


Updating scenarios 


Like many other heuristics, TA is computationally intensive, with the main part of running time 
spent on evaluating the objective function. In our case, the objective function requires two steps: 
compute Rw (or Pu), then evaluate the resulting vector. We can often reduce computing time by 
carefully analyzing and profiling (and then rewriting) the objective function. 

The matrix R is often large, storing thousands of scenarios for hundreds or thousands of assets. 
Without structure inherent in R, it may seem unlikely that we can speed up this multiplication. 
But structure exists in the way the optimization method proceeds from one solution to the next. 
With our neighborhood function, we will only change the portfolio weights of two assets in one 
iteration—then we can update the multiplication as follows: Let w4 be the vector of portfolio 
changes, then 


A 


n 


w= ww 
Rw” = R(w° + wô) = Rw? +Rw*. 
——’” 
known 
Let R, denote the submatrix of changed columns (size ns x 2) and wô the vector of position 


changes (size 2 x 1), then we can replace Rw“ with R; wô. This makes the matrix multiplication 
practically independent of the number of assets. The procedure is summarized in Algorithm 61. 


Algorithm 61 Threshold Accepting—scenario updating. 
1: 


randomly generate current solution w° 


2: 

3: compute r? = Rw° 

4: for t= 1: Nyoungs dO 

5: for s = 1: Ngteps do 

6: generate w" € N (w°), and compute R,,, wô 
T: re=rP +R, wô 

8: compute A = &(w", r?) — (w°, r?) 
9: if A <q then w° =w", r” =r} 
10: end for 
11: end for 
Wy az 


When we implement this updating, we could specify our neighborhood M so that it returns not 


just the new solution, but also the weights that have changed so that we can set up P, and wô. 
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We have explicitly written ®(w, r?) to emphasize the objective function’s dependence on both the 
portfolio and data. In principle, this approach could also be applied to other methods; but we need 
to make sure that only a few asset weights are changed in every iteration. We can, for instance, no 
longer repair the budget constraint by simple rescaling. We mentioned above that TAopt works 
on solutions through functions, so a solution can be more than just a vector of weights. In fact, 
the product Rw is always associated with a given weight vector, so it need not be computed in the 
objective function, but can be part of the neighborhood. In this way we can implement the updating 
mechanism without changing the function TAopt. 

Let us look now at a numerical example. We start with a random solution. Such a solution is 
now a list of weights w and a vector Rw. (Even better might have been to store Rw as an attribute to 
w. However, it would also have made the examples look slightly more complicated.) 


The neighborhood function then looks as follows (U stands for Updating). 
> neighbourU <- function(sol, Data) { 
wn <- solSw 
toSell <- which(wn > DataSwmin) 


[neighbourU] 


toBuy <- which(wn < DataSwmax) 
i <- toSell[sample.int(length(toSell), size = 1L)] 
jJ <- toBuy [sample.int(length(toBuy), size = 1L)] 


eps <- runif(1) * DataSeps 


eps <- min(wn[i] - DataSwmin, DataSwmax - wn[j], eps) 
wn[i] <- wn[i] - eps 
wn[j] <- wn[j] + eps 


Rw <- solSRw + DataSR[, 
tist (w = Rw = Rw) 


c(i, J)] %*% c(-eps,eps) 
wn, 


} 


We prepare the data, again. 


> na <- dim(fundData) [2L] 
> ns <- dim(fundData) [1L] 
> wmin <- 0.0 
> wmax <- 0.05 
> Data <- list(R = fundData, 
Taye’ = ae 
ins. = ine, 
eps O. 5/100), 
wmin wmin, 
wmax wmax) 


> OF <- function(sol, Data) 
crossprod(sol$Rw) 
> w0 <- runif (DataSna) 
> w0 <- w0/sum(w0) 
> x0 <- list(w = w0, Rw = fundData %*% w0) 
=al <= List (Fd) = 50), 
neighbour = neighbourU, 
nS = 2000L, 
iad) = ALO, 
AD = 5000L, 
e = 020 
printBar = FALSE) 


> system.time(res2 <- TAopt (OF, 


algo, Data) ) 
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Threshold Accepting 


Computing thresholds ... OK 
Estimated remaining running time: 0.928 secs 


Running Threshold Accepting 
Initial solution: 0.24 
Finished. 

Best solution overall: 0.00566 
user system elapsed 

1.192 0.257 1.062 


> c(100*sqrt (crossprod(fundData %*% res2S$xbest$w) /ns) ) 


[1] 0.336 


As a final remark, for this kind of problem we could actually speed up convergence. The best 
heuristic algorithm will be the one that resembles a gradient search as closely as possible. For 
instance, if we set all thresholds for TA to zero, the solution quality improves faster. But we will 
not do that. The aim of this problem was to see if our implementations worked reliably, not if 
TA could compete with QP on such a problem—which TA cannot. Heuristic methods deliberately 
employ strategies like randomness, or the acceptance of inferior solutions, which make heuristics 
inefficient for well-behaved problems like here when compared with classical methods. But these 
strategies are necessary to move away from local minima. In fact, changing model (14.18) only 
slightly already renders classic methods infeasible. To such a problem we go next. 


Minimizing a risk-reward ratio 
Aim: to find a long-only portfolio w of between Kymin and Kmax assets that minimizes a risk- 
reward ratio across a scenario set. 


The model is the following: 


min ® 
Ww 


wi=l, 
O<uj sui for j =1,2,..., nA, 


K min < K < K max . 


We set we to 10% for all assets, hence Kymin cannot be smaller than 10. The cardinality constraints 


are included to enforce a small portfolio (say, between 10 and 30 assets). !? 
As the objective function we choose the ratio of two conditional moments, so 


CM; _ 
=. (14.19) 
CMy+ 


This ratio has also been called the Generalized Rachev ratio (Biglova et al., 2004). We do not nor- 
malize the conditional moments; out-of-sample comparisons showed that there is little difference 
between CM, and CM?” (Gilli and Schumann, 201 1c). Computing this objective function is more 
expensive than computing squared returns as we did in the previous example, in particular for non- 
integer exponents. There are different ways to compute powers of returns Rw (recall that r” = Rw). 
If the exponent is 2, we can compute inner products. R’s function crossprod is sometimes more 


12. If they were to (meaningfully) enforce greater cardinality, we would need to include minimum holding sizes greater 
than zero for the assets in the portfolios. 
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FIGURE 14.5 Squared returns (left panel) and ratio of conditional moments (right panel) for different portfolio weights. 


efficient than %«%, but altogether the differences are less pronounced; only for larger vectors do 
we get sizable speedups. In MATLAB the differences are greater. The computation >>; (r? )? should 
always be carried out as Rw’ » Rw. When sucha scalar product is not possible but the exponent is 
an integer like in }0,(r? )3, it is faster to compute the element-wise product Rw .* Rw .* Rw, 
and then sum. The following code example compares different ways of computation. 


Listing 14.6: C-PortfolioOptimization/M/./Ch13/condMoment.m 


1) % condMoment.m -- version 2010-12-31 

2} rp = randn(10000,1); trials = 1000; 

3 

4| %% squared 

5}/tic, for i = 1:trials, a = rp’ * rp; end, toc 
6| tic, for i = 1:trials, b = sum(rp .* 2); end, toc 
7| max (abs (a-b) ) 

8 

9| %% other exponent 

10j;e = 4; 

il) tig 

12) for d=letrials 

13 r= rp; 

14 for j = 2:e, r =r .* rp; end 

15 a = sum(r); 

16| end 

17| toc 


18| tic, for i=1:trials, b = sum(rp .* e); end, toc 
19| max (abs (a-b) ) 


Minimizing this risk—reward ratio is more demanding than minimizing squared returns not just 
because it is more expensive to evaluate a solution.'* Suppose we have to solve a three-asset prob- 
lem with only the budget constraint. Fig. 14.5 shows the objective functions of two different choices 
for ®: on the left, we minimize squared returns (minimizing variance would result in a similar pic- 
ture); on the right we minimize the ratio described above. The pictures show the objective function 
values for different asset weights (the third weight is fixed by the budget constraint). For the risk— 
reward ratio, the direct mapping from solutions into the objective function leads to a very rough 
surface. The problem gets more complicated because it actually combines two tasks: a combinato- 
rial one (find the correct assets), and a continuous one (find the correct weights). 

The new feature in this example is the explicit cardinality constraint. We can modify our last 
neighborhood (Algorithm 60) as described in Algorithm 62 which leads to the function neigh- 
bourUK (UK stands for Updating and Kardinalty.) 


13. If instead we minimized a ratio of partial instead of conditional moments, the computation would be slightly cheaper 
since we would not have to count the positive and negative scenarios. Also, the objective function is smoother with partial 
moments than with conditional moments. 
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Algorithm 62 Neighborhood (budget constraint, cardinality constraints). 
1: sete 
if Kin < K < Kmax then 
randomly select i € { assets with weight > o} 
randomly select j € { assets with weight < w 
else if K = K min then 
randomly select i € {assets with weight > e} 
randomly select j € {assets with weight < w 
else 
randomly select i € {assets with weight > o} 
10: randomly select j € {assets with weight > 0 and < w 
11: end if 
12: adjust € 
13: w; = wj — € 
14: wj =wj +e 


max } 


max } 


SO 500 N T OOS, 


max } 


> neighbourUK <- function(sol, Data) { 
wn <- solSw 
oe <= sar > 10) 
K <- sum(J) 
eps <- DataSeps * runif(1) 
if (K > Data$SKmin && K < Data$Kmax) { 
toSell <- wn > 0 
toBuy <- wn < DataSwmax 
} else { 
gic, (X == IDyaVeelSiMtiteb<)) 4 
toSell <- wn > 0 
toBuy <- J & (wn < DataSwmax) 
} else { ## at Data$Kmin 
toSell <- wn > eps 
toBuy <- wn < DataSwmax 


} 

i <- resample (which(toSell1),1) 

j <- resample (which (toBuy) ,1) 

eps <- min(wn[i], DataSwmax - wn[j], eps) 

wn[i] <- wn[i] - eps 

wn[j] <- wn[j] + eps 

Rw <- solSRw + DataSR[, c(i,j)] %*% c(-eps,eps) 
list(w = wn, Rw = Rw) 


} 


A straightforward objective function looks as follows. We pass the exponents y with the list 
Data. In the example we use y = 2 for both gains and losses. 


> OFcmR <- function(sol,Data) { [OFcmr] 
Rw <- solSRw 
losses <- Rw - abs (Rw) 
gains <- Rw + abs (Rw) 

nL <- sum(losses < 0) 

nG <- sum(gains > 0) 

vG <- sum(gains*Data$eG) 

v] ( ) 

( 


L <- sum (abs (losses)^Data$eL) 
vL/nL) / (vG/nG) 
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We could include control structures to change the computation depending on the exponents (e.g., 
use crossprod if exponents are 2). The simplest way is to explicitly pass an instruction with 
Data that details if the exponent is 2, not 2 but an integer, and so on. A complete example follows. 


> ## prepare Data 
> na <- dim(fundData) [2L] 
> ns <- dim(fundData) [1L] 


> Data <- list(R = fundData, 


Taye Siae ins Sinks}, 
Gos = 05/100, 
wmax @) sil, 


eG = 2, eL = 2, 
Kmax = 50L) 
> Data$Kmin <- ceiling(1/DataSwmax) ## computed from DataSwmax 


## initial solution 

card0 <- sample(Data$Kmin:Data$Kmax, 1) 

assets <- sample.int(Data$na, card0, replace = FALSI 
w0 <- numeric(Data$na); wO[assets] <- 1/card0 

sol0 <- list(w = w0, Rw = fundData %*% w0) 


iat 


Vn VN enV NY 


> algo <- list(x0 = sol0, neighbour = neighbourUK, 
AS = LOO alin’ =e O, 
nD LOGOO; cr = 0S) 
printBar = FALSE) 
> system.time(res <- TAopt (OFcmR,algo,Data)) 


Threshold Accepting 


Computing thresholds ... OK 
Estimated remaining running time: 0.415 secs 


Running Threshold Accepting 
Initial solution: 1.44 
Finished. 

Best solution overall: 1.01 
user system elapsed 

1.030 0.335 0.868 


> plot(res$Fmat[, 1], type = "1") ## not shown in text 
> resSOFvalue; sum(res$xbest$w <= 1le-8); sum(res$xbest$w > le-8) 


[1] 1.01 
[2]. 25,0 
[1] 50 


It is much more difficult to evaluate the quality of the solution since it is difficult to give an 
economic meaning to such a ratio. Some possible benchmarks could be: 


e Choose feasible random portfolios and compare them with the solution obtained with TA. 
Random portfolios should be easy to beat (in-sample), they rather serve as a check of the im- 
plementation. Here we sample 10 million equal-weight portfolios with feasible cardinalities and 
plot the best 10,000 of these. 
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e Compute the objective function for each single asset, sort them, and compare portfolios built 
from the best assets. 

e Solve a similar but simpler problem; for instance, try to find the best equal-weight portfolio of 
fixed cardinality. 

e Solve the model with another heuristic and compare the results. 


The simplest diagnostic is to run the algorithm several times and increase the number of iterations. 
The following picture shows the results. We leave the number of thresholds at 10, and restart TA 
100 times with 2000, 5000, and 15,000 steps per threshold, respectively. 

The distribution slowly moves to the left, but so far there is no telling whether we are near the 
global optimum. We could now increase the iterations further but there is also another way. As 
we Said before, the optimization problem here can be seen from two perspectives: as a continuous 
problem (find good weights), or as a discrete one (find good assets). Our neighborhood function 
emphasizes weights; there is no explicit mechanism to change the cardinality. 


1.0, 

Best of 
0.8 L random 
0.6 F 
0.4 F 
0.27 
0.04 \ f i f f 

0.6 0.8 1.0 1.2 


0.4 1.4 


Objective function 


There are good reasons for choosing such a neighborhood specification. We would like the 
changes in the objective function, for a specified neighborhood definition, to be similar all over the 
solution space. This is important in Threshold Accepting, for suppose that for a given neighbor- 
hood the changes can either be very large or very small. It will then become more difficult to set 
up effective threshold sequences: with large thresholds, essentially all changes that lead to small 
changes in the objective function will be accepted; with small thresholds, we cannot escape local 
minima that require large changes in the objective function. You may now (rightfully) argue that 
the small changes in the objective function are by definition less important than the large changes, 
so if we do not like this behavior, we have either misspecified our model (i.e., we need to rewrite 
our objective function), or we should realize where our priorities are. Something similar is the case 
here. 

Increasing the cardinality with the neighborhood as specified above means to pick an asset with 
zero weight, and increase this weight slightly. So an asset can be quickly added to the portfolio, but 
suppose its weight should be 10%, then it will possibly need a large number of iterations to arrive 
there. Conversely, if an asset does not belong into the portfolio it will have its weight decreased 
until, finally, the weight drops to zero. In that way, the impact on the objective function value of a 
given portfolio will be small when an asset is added or removed, no matter how good or bad this 
particular asset appears. This is fine if we need a precise solution, but it may take time to change 
the asset composition, in particular to get rid of an asset (i.e. decrease cardinality). 

When we pick a small number of assets from a large universe of different assets, the asset 
selection—which assets to put into the portfolio—becomes often more important than computing 
weights (in particular, if the weights cannot become very large, as is the case here). We have defined 
the neighbors of a portfolio x° as those portfolios that differ only by a small fraction of the weights 
from x°. If asset selection is more important, a reasonable alternative definition of a neighborhood 
would be portfolios that differ by no more than a fixed number of selected assets. (We used such a 
neighborhood definition in Section 14.3.1.) We do not need to rewrite our code to test this strategy. 
We only change two details: first, we use fixed weight changes, so in the neighborhood function 
instead of 


> eps <- DataSeps * runif (1) 
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we use 

> eps <- DataSeps 

We also use a larger step size, for instance 2%, hence 
> DataSeps <- 2/100 


The following figure shows results with 3000, 10,000, 50,000, 100,000 and 200,000 steps per 


threshold (in increasingly darker gray). 
Small (0.5%) 
J neighborhood 
08 


1.07 
Large (2%) 
| neighborhood 


Best of 


random 
0.8 


0.6 F 


0.4} 


0.0- L i fi J 


1.0 1.2 1.4 
Objective function 


We have left the results obtained with the “small” neighborhood. We obtain lower objective 
function values in less time when compared with a weight-based neighborhood definition. 

The example of this section demonstrates that the neighborhood has a large impact on the per- 
formance of TA. The example also allows us to demonstrate the advantage of TA when compared 
with Local Search. We use the same neighborhood and the same number of iterations, but once we 
include thresholds (TA) and once we use zero thresholds (i.e., Local Search). The following two 
figures show objective function values over the course of the optimization. 
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In the left panel, we plot the best solutions found by the algorithms. The light gray lines are the 
results obtained with Local Search, the dark ones come from TA. In the right panel, we plot, for a 
single run of TA, the objective function values of the best solution and also the accepted solution. 
Since TA accepts inferior solutions, the function values do not decline monotonously. Initially, with 
larger thresholds, the algorithm often moves to portfolios with a higher objective function value. 
Over time, as the thresholds become smaller, the objective function descends more smoothly. 


A simple hybrid: Local Search and QP 


Aim: to find a long-only portfolio w with cardinality limits that minimizes variance. 


We want to solve the following problem: 


min ® (14.20) 
w 
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wi=l, 
min 


Wj 


<wj swp forj=jeJ, 


Kmin < K < K max . 


We set wmin to 1% for all assets. The set J holds the assets included in the portfolio, and K 
stands for the portfolio’s cardinality, i.e. #{J}. This is a natural candidate for a simple hybrid: 
run a Local Search for the asset selection, and in each iteration of the Local Search compute the 
minimum-variance portfolio for the selected assets. 


> R <- fundData [hybrid-Data] 
=> sy <= GOVER) 
S> ial << EA) 
> Data <- list(R = R, 
SES), 
ine =. lel, 
anata, = 25}, 
RES = 50; 
wmin = 0.01, 
wmax = 0.05) 


The objective functions takes a selection of assets x (which will have many FALSE elements), 
computes the minimum-variance portfolio for the given assets, and returns the variance of that 
portfolio. We could use the script provided above; but to keep the code shorter we use the imple- 
mentation provided in the function minvar in the NMOF package. That function evaluates to the 
minimum variance weights for a given variance—covariance matrix, i.e. a vector. That vector has an 
attribute variance, which provides the variance associated with the weights. 


> mv <- function(x, Data) { [of-mv] 
attr (NMOF: :minvar(Data$SS[x, x], 
wmin = DataSwmin, 
wmax = DataSwmax), "variance" ) 


} 


Note that we have prefixed minvar with its namespace: NMOF: :minvar. We discuss below 
why. 
The neighborhood function. 


SIN => UnC E (se, a Data) [N] 
xn <- X 
k <- sum(xn) 
i.in <- which(xn) 
i.out <- which(!xn) 


imc (0k S= Datas kmax) { 
i <- sample(i.in, 1) 
} else if (k == Data$Kmin) { 
i <- sample(i.out, 1) 
} else { 
i <- sample(Data$na, 1) 
} 
xn[i] <- !xn[il] 
xn 


} 


Note that the function N requires that Kmin < Kmax. Equality should best be handled as a 
special case, but we leave it out here to keep the function simple. Nevertheless, we sketch how such 
a neighborhood could look like in an appendix to this chapter. 

We put the steps for creating an initial solution into a function random_x. 
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[random-x] > random_x <- function(Data) { 
ans <- logical (Data$na) 
k <- if (Data$Kmax > Data$Kmin) 
sample (Data$Kmin:Data$Kmax, 1) 
else 
Data$Kmin 
ans [sample (Data$na, k)] <- TRUE 
ans 


> table (random_x (Data) ) 


FALSE TRUE 
154 46 


> table(random_x (Data) ) 


FALSE TRUE 
169 3L 


Test runs with LSopt and TAopt. 


[test-run] > x0 <- random x (Data) 
> sol.ls <- LSopt (mv, 
List Gd) = 20), 
neighbour = N, 
mi = LOO, 
Gikas/Shsiayar— ac RUE ne 
Pisin Bat =- HAM SE), 
Data = Data) 


Local Search. 

Initial solution: 0.000192 
Finished. 

Best solution overall: 1.09e-05 


> sol.ta <- TAopt (mv, 
list(x0 = x0, ## same x0 as for LSopt 
neighbour = N, 
nI = 10000, 
classiry = RU Ey 
printBar = FALSE), 
Data = Data) 


Threshold Accepting 


Computing thresholds ... OK 
Estimated remaining running time: 10.2 secs 


Running Threshold Accepting 
Initial solution: 0.000192 
Finished. 

Best solution overall: 1.09e-05 


> all.equal(sol.1ls$xbest, 


[1] TRUI 


CG] 


The function random_x has, for our purposes, one weakness: it takes an argument. We would 
prefer to have a function that creates random initial solution because both LSopt and TAopt (and 
SAopt as well, should you happen to try it) will inspect the value of x0 that we pass. If it turns out 


Portfolio optimization Chapter | 14 409 


sol.ta$xbest) 


to be a function, LSopt et al. will use x0 () as a starting value, i.e. they will evaluate the function. 


But because in this invocation no arguments 


are allowed, we need a function without arguments. 


Fortunately, R naturally supports creating such a function. (Actually, it is the other way around: 


Had R not supported such a mechanism, LSopt et al. would have handled the initial solution 


differently.) 


All we actually need is to produce a function for specific parameters, i.e. a function that binds 
values to names such as wmin. Such a function is called a closure. The function random_x_fun 


evaluates to a function without arguments. 


> random_x_fun <- 
na <- DataSna 
kmin <- Data$Kmin 
kmax <- Data$Kmax 
stopifnot (kmax > kmin) 
function() { 

ans <- logical (na) 

# if the case ‘kmin 


function(Data) { 


kmax’ would 


# need to be supported, we needed to 
# hedge: if kmin equals kmax, sample 
# will produce one value from the 
# range 1:kmax. See ?sample. 
k <- sample(kmin:kmax, 1) 
ans[sample(na, k)] <- TRUE 


ans 


} 


Let us call the function and look at its result. 


> x0_fun <- random_x_fun (Data) 
ORUN 


function() { 
ans <- logical (na) 
+H 
## 
+H 
## 
## range 1:kmax. 
k <- sample(kmin:kmax, 1) 


if the case 


hedge: if kmin equals 


will produce one value 


‘kmin == kmax’ would 
need to be supported, we needed to 


kmax, sample 
from the 


See ?sample. 


ans[sample(na, k)] <- TRU 
ans 
} 


<environment: 0x55817132a5f£0> 


The values that we passed in are still there. To retrieve them, we have to explicitly ask for them. 


> ls(envir = environment (x0_fun) ) 


[1] "Data" "kmax" "kmin" "na" 


[random-x-fun] 


[x0-fun] 


[OF-values] 


[random-solutions] 
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> get("kmin", envir = environment (x0_fun) ) 
[1] 25 


> get ("kmax", envir = environment (x0_fun) ) 


[1] 60 
> str(as.list (environment (x0_fun) ) ) 
List of 4 
$ kmax: num 60 
$ kmin: num 25 
$ na > int 200 
$ Data:List of 7 
Jes È num [1:500, 1:200] 0.00401 0.00314 -0.008.. 
sess : num [1:200, 1:200] 1.65e-04 7.86e-05 9.77.. 
>. na z ant 200 
..S Kmin: num 25 
..S Kmax: num 60 
..S wmin: num 0.01 
.$ wmax: num 0.05 


So we specify the function x0_fun as the initial solution. 


> algo <- list(x0 = x0_fun, 
neighbour = N, 
ai = 500, 
classify TRUE 
printBar = FALSE) 


We restart both LSopt and TAopt with 500 iterations, and we instruct restartOpt to do 
these restarts in parallel. This is why we prefixed minvar with its namespace, i.e. we called ex- 
plicitly NMOF: :minvar. 
=> Sol IS. 500 <- restar rtoOot (Use, im = 100; Ola = m; 
algo = algo, Data = Data, 
method = "Snow", cl = 6) 
= Goll ica, 500 <s- resrcar coot (iuisejoe, ia = 10K), Clr = in, 

algo = algo, Data = Data, 


meho = “sinew, Gil = 6) 
We extract the objective function values. 
= Of, 18,500 <= sense (Seicolky (sol. Js. 500, “I[*, TaS 
> of .ta.500 <- sqrt(sapply(sol.ta.500, ‘[[*‘, "OFvalue") ) 


We create 500 random solutions. We plot the distributions of the random solutions and of the LS/TA 
runs in Fig. 14.6. We also increase the iterations for both LS and TA; see Fig. 14.7. 


> of random < numeric(500) 

> for (i in seq_along(of.random) ) 
of.random[i] <- mv (random_x (Data), 

> of.random <- sqrt (of.random) 


Data) 
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FIGURE 14.6 Distributions of objective function values for random portfolios (light gray) and solutions of TAopt and 
LSopt (darker gray). There is little difference between Local Search and Threshold Accepting. 
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FIGURE 14.7 Distributions of objective function values with 500, 1000 and 2000 iterations. The lighter-gray lines belong 
to TAopt, the darker-gray ones to LSopt. There is virtually no difference between the lines for Local Search and Threshold 
Accepting with 2000 iterations; both methods converge quickly on the same portfolio. 


> plot (ecdf(of.random) , 
man 
xii = eO; 
max(of.random, 
ONE. desko SOW), 
Cie Ca SOO) y 
colL = graS On = WA Varices 
> lines (ecdf(of.1s.500), 
Colm = Torey. (0n2)r oen = A Vaia = TRUE) 
> lines (ecdf(of.ta.500), 
Coil = epee (O74), pea = UWA, verticals = TRUD) 


TRUE) 


14.3.4 Diagnostics for techniques based on Local Search 


In this section, we discuss potential problems that may occur when implementing portfolio selection 
models and some “rules of thumb” that we found helpful in spotting trouble. Though we will often 
refer to TA specifically, several of these points are relevant for similar techniques (such as Local 
Search or Simulated Annealing) as well. 

It is difficult to clearly characterize where TA works well, and where it does not. Thus, analyzing 
the performance of the algorithm typically requires experimentation. Generally, TA may run into 
trouble if the objective function is very noisy, or very flat overall. 


Benchmarking the algorithm 


When we implement the algorithm afresh, it is a good idea to start with a well-known problem, 
one which can also be solved with another method. The obvious candidate is a mean—variance 
optimization with a quadratic programming solver as we did above. This does not only help to 
spot errors in the implementation, but also gives an intuition of how closely TA approximates the 
exact solution. If, for a convex problem like mean-variance, the solutions from TA differ widely 
across different optimization runs, this indicates insufficient computational resources (i.e. too few 
iterations). 


[hybrid-dist] 
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The neighborhood and the thresholds 


It is helpful to gain some intuition into the local structure of the objective function during the 
search. For low-dimensional problems, we can plot the objective function as in Fig. 14.5. With 
more dimensions, we can take random walks through the data (like in Algorithm 58). Thus we 
start with a random portfolio and move through the search space according to our neighborhood 
function, accepting any new portfolio. The changes in the objective function accompanying every 
step should be visually inspected; for instance, with histograms or CDF plots. A large number of 
zero changes indicates a flat surface: in such regions of the search space the algorithm will get no 
guidance from the objective function. Another sign of potential trouble is the clustering of changes, 
that is, if we have a large number of very small changes and a large number of very large changes, 
which may indicate a bad scaling of the problem. 

The observed changes should be of a roughly similar magnitude as the thresholds, which is why 
such random walks are often used to inform the threshold setting, as was described in Algorithms 57 
and 58. If the thresholds are too large compared with average changes, our optimization algorithm 
will become a random search, since any new portfolio will be accepted. If the thresholds are too 
small, the algorithm will be too restrictive, and become stuck in local minima. 

During the actual optimization run, it is good practice to store the accepted changes in the 
objective function values (i.e. the accepted A-values in Algorithm 56). In TAopt, we do this in the 
matrix Fmat. As a rule of thumb, the standard deviation of these accepted changes should be of the 
same order of magnitude as the standard deviation of the changes in the objective function values 
recorded from the random walk (i.e., the A;-values in Algorithms 57 or 58). 

In any case, the path that a solution has taken through the search path should be studied. As a 
measure of performance, we can store the amount of improvement from an initial (random) solution 
to the final solution. 


Testing the neighborhood function 


The neighborhoods discussed above have all been simple. As a start, simple is good. Simple makes 
it easier to find mistakes (or make fewer in the first place), and it also allows us to enhance the model 
later on, for instance, by adding more constraints. We can later always add more features to the 
algorithm, and better judge how they perform. There is a trade-off: more-complex neighborhoods 
may be more error prone and less flexible—but they are often more efficient. Some remarks: 


e Use inequalities or tolerance levels instead of exact equality if comparisons are important in a 
neighborhood function. 

e When testing a neighborhood, start with extreme portfolios (like all weights at upper boundary, 
initial cardinality of portfolio equal to Kmin, and so on). 

e If cardinality constraints are important, then check how cardinality changes over time. 


Neighborhoods should also be tested on specific test problems. For instance, start with one 
particular portfolio and define a feasible target portfolio. Now run TA with the objective to reach 
the target portfolio, i.e. the objective function is the distance to the target portfolio. If the target 
cannot be reached, something is wrong with neighborhood. More examples are in the NMOF manual 
(Schumann, 201 1—2018a). 


Arbitrage opportunities 


A serious empirical problem with scenario optimization are arbitrage opportunities in the scenario 
set. This is a problem of the setup, not of the optimization, so it is relevant for any optimization 
technique, not just for TA. There exist formal tests to detect arbitrage (described, for instance, in 
Ingersoll, 1987, Chapter 2), but they do not really help to remove them. Furthermore, these tests will 
not find spurious “good deals” in the data. The resulting overfit is particularly pronounced if short 
positions are allowed, because then the algorithm will finance seemingly advantageous positions 
by short-selling less-attractive assets. This becomes most obvious if we consider portfolios that 
also include options. If some stock never drops more than 10% in our scenarios, the algorithm will 
suggest writing a put on this stock at 90% of the current price. In fact, an unconstrained algorithm 
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will sell an infinity of options. Such problems are not limited to long-short portfolios, long-only 
portfolios also overfit in such cases, even though the effect is less pronounced (Gilli et al., 2011). 

A practical solution for the equity-only case is to increase the number of observations in relation 
to the number of assets. When working with historical data, we may either use a longer historical 
time horizon, or reduce the number of selectable assets. If we work with resampled scenarios, we 
can simply increase the number of replications. 

Including restrictions like maximum holding sizes is always advisable, even though it just re- 
flects the fact that we cannot model asset prices properly. Such constraints may not just limit 
position sizes, but if applicable we can also include constraints on aggregate Greeks like Delta 
or Gamma, or a minimum required performance in added artificial crash scenarios. 


Degenerate objective functions 


Numerically, ratios have some unfavorable properties when compared with linear combinations. 
If the numerator becomes zero, the ratio becomes zero and the search becomes unguided; if the 
denominator is zero, we get an error (Inf). If the sign of the numerator or denominator changes 
over the course of the search, the ratio often becomes at least hard, if not impossible to interpret; 
an example is the Sharpe ratio for negative mean excess return. !4 

So we generally need safeguards. To avoid sign problems, we can use centered quantities. Lower 
partial moments, for example, are computed as ra — r for returns lower than r4; thus, the numbers 
will always be nonnegative. An alternative is to use operations like max(- , 0) or min(- , 0) to assure 
the sign of some quantity. 

But there is also a valuable aspect in these instabilities, for they are not only of a numerical 
nature. In fact, problems when computing a ratio may indicate problems with the model or the 
data, which would go unnoticed with a linear combination. For instance, if a risk—-reward ratio 
turns zero, this means we have found a portfolio with no risk at all—which, unfortunately, is more 
often an indication of a data problem rather than a really excellent portfolio. 


14.4 Portfolios under Value-at-Risk 
14.4.1 Why Value-at-Risk matters 


In previous sections, we have seen that heuristics are a powerful approach in portfolio optimiza- 
tion: they can deal with all sorts of objectives and constraints that are not accessible to traditional 
methods. This is a good thing. However, it requires that the criteria are sound and reflect all relevant 
aspects. If, on the other hand, a risk or performance measure has shortcomings, then these might get 
exploited during the optimization process, and the resulting portfolio will be anything but optimal 
from the investor’s point of view. One such criterion is Value-at-Risk. 

Value-at-Risk (VaR), the maximum loss one will not exceed with a probability (1 — œ) by the 
end of an investment horizon of length 7, is an intuitive risk measure, and is key in regulatory 
frameworks such as Basel II and Basel III. These are enough reasons to consider it for portfolio 
optimization. Section 9.2 introduced the basic ideas behind VaR, including its advantages and dis- 
advantages, as well as some aspects about its estimation. Like other risk measures, VaR can be 
used as an explicit constraint in asset selection; for example, when equity requirements restrict the 
exposure of a leveraged portfolio, then an upper limit on the investment’s VaR is implied. VaR 
could also be set as an explicit objective: the investor could be interested in finding the lowest pos- 
sible probability of exceeding a given level of loss, or finding the lowest loss possible for a given 
exceedance probability. 

Either way, minimizing the VaR could be seen similar to any other attempt of minimizing risk, 
akin to the minimum variance portfolio. This, however, would ignore an important aspect: VaR 


14. Strictly speaking, the Sharpe ratio is still interpretable: suppose two portfolio managers delivered the same negative 
returns (-1%, say), and one had a higher volatility than the other. Then the high-risk fund did better, in a way: despite higher 
risk, the portfolio manager succeeded in providing the same small loss as the low-risk manager. But this implies a perverse 
effect: once returns are negative, the algorithm will try to maximize risk. Thus, one may rather prefer a linear combination 
of risk and return instead of a ratio, or simply put a safeguard into the objective function: when return turns negative, change 
the selection criterion (e.g. select only based on risk). 
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looks at the quantile of a distribution. If returns (or values) are described via a parametric distri- 
bution, then their values depend both on the location and the dispersion of the potential outcomes. 
To see this, assume the portfolio’s return for the investment period is normally distributed with 
rp~ N(up, ae), then the w quantile is i = Up +ugop where ug = (æ)! is the quantile of 
the standard normal distribution with ® (ug) = a. In a mean-variance world, in which, as argued in 
Section 14.2, the objective is to find the optimal trade-off between the expected return (to be max- 
imized) and risk (to be minimized), a VaR-optimal portfolio lie on the efficient frontier. Eq. (14.2) 
can also be stated in terms of volatility, 


min f (w) = (1 — A)s? — Am? 


where m” = E(r?) and s?” = ./Var(r?) are the expected portfolio return and its volatility, respec- 
tively. Note that à still assumes values between O and 1, but it affects the objective slightly 
differently than before in terms of sensitivity. Multiplying f with —1 turns it into a maximization 
problem. Dividing f by à (provided A > 0) will not change the optimal solution w*, but reveals 
the relationship between mean-variance-efficiency and VaR under a normality assumption since the 
problem would become 


max m’ —j/s? 


where i’ = (1 — A)/A. From here, 4’ = —ug will find the portfolio that has the highest possible 
VaR threshold, pee which will be a mean-variance-efficient portfolio. (Remember that ug < 0 
for any a < 50%, which is true for typical VaR settings. Also, note that maximizing quantiles with 
a > 0.5 could violate the assumption of risk aversion.) 

Empirical returns are often only approximately normally distributed. Deviations mainly concern 
the tails: leptokurtosis and fat tails imply that extreme returns are more likely than under a normal 
distribution. In addition, there is often negative skewness since large surprises are rather on the 
negative side. All this affects VaR, since it ultimately looks at the tails of the distribution. One 
might therefore expect it might be a good idea to abandon the normality assumption and work with 
empirical distributions instead. In that case, the VaR is an order statistic, and one no longer uses the 
dispersion of the entire distribution for its estimation. 

Following and expanding on Maringer (2005a,b) and Winker and Maringer (2007a), this section 
has a look at distribution assumptions and portfolio optimization under VaR. 


14.4.2 Setting up experiments 


Estimating VaR is less of a problem if the distribution of the asset returns or the value itself follows 
some parametric distribution with known parameters. In the absence of perfect foresight, these pa- 
rameters have to be estimated, which introduces a first source of errors. In addition, the assumed 
distribution might not be appropriate for the investigated asset, which adds another source of errors 
that is typically more systematic.!° Thirdly, if parametric assumptions are abandoned and order 
statistics are used to estimate the empirical VaR, then important aspects in the overall distribution 
might be ignored. Finally, we might miss that there is some structural break and that the data we 
fit our model on is not representative of the actual investment period. Perfect foresight is highly 
desirable but unachievable in the real world; working with inappropriate data is anything but de- 
sirable but sometimes hard to avoid and recognizable only with hindsight. Here we rule out these 
situations and focus on the remaining two: we make assumptions about the distribution that might 
or might not be suitable, but validation data follow the same rules as the observed data on which 
the estimations and optimizations are based. 

We start with generating the sample of observed prices; for our purposes, we are not taking a 
historical time series as is, but simulating data to ensure that we can have a validation sample that 
comes from the same data generating process (DGP). In particular, we assume three DGPs: fol- 
lowing standard reasoning, we model prices to be lognormally distributed with P, = P;_,e’’ where 


15. Section 9.2 has a closer look at that particular problem. 
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rı ~ N(w = 0.05/250; o? = 0.257 /250). In addition, we assume a linear correlation between asset 
returns of p = 0.3. The investment horizon is H = 5 days, and that is also the block length for the 
bootstrapped data. A sample consisting of S = 1000 training observations (representing roughly 4 
years of daily observations) forms the basis for the portfolio optimization, and another S control 
observations from the same DGP is generated to assess the out-of-sample performance. 

In terms of data structures, the returns are stacked into an (H - S) x N matrix r where the 
columns are for the N different assets. The rows are organized such that H days of the sth of the S 
observations are found at indices (S-1)*H + (1:H). Next, the end-of-day prices are computed. 
Assuming that each asset has a price of | at the beginning of the investment horizon, the following 
MATLAB code can be used: 


C = eye(N)*(1-rho) *sigma*2+ ones(N) «(rho*sigma*2); % covariance matrix 

r = randn(H*S,N) * chol(C) + mu ; % the (H*S) x N matrix with daily 
returns 

P= T? 

gS = Tes; 

for h = 2:H % cumulative returns 
P((s-1)*H + h,:) = P((s-1)*H + h-1,:) + P((s-1)*H + h,:); 

end 

P = exp(P); % convert cumulated returns into prices 


To find the corresponding price realizations of a portfolio with asset weights w (N x 1), we use 
PF = reshape( P *« w, H, S); 


This produces an H x S matrix where each column represents one of the S$ simulated paths of 
length H. 

If this sample forms the basis for the investor’s decision, the portfolio weights are sought by 
optimizing in the fashion of a historical simulation. For the sake of simplicity, we assume an in- 
vestor with an initial endowment of 1, that assets are arbitrarily divisible, and the only constraints 
on the weights are non-negativity and that 100% of the endowment needs to be invested in any of 
these assests. A risk-free asset is not available. The given (training) sample then forms the basis 
for estimating the Value-at-Risk. If normally distributed returns are assumed, then the computation 
involves estimating the mean and volatility of the returns first. 


rPF = log(PF(end,:)); % returns after H days; initial P=1 
rVaR = mean(rPF) + norminv (alpha) * std(rPF) ; 
PFVaRnorm = exp(rvVaR) ; 


For empirical distributions, that is simply the a quantile of the simulated prices at the end of the 
investment horizon. The critical portfolio value then is 


PFVaRemp = quantile( PF(end,:), alpha); 


Either way, PFVaR represents the threshold that is fallen short of with a probability of œ, and an 
investor will want to see this critical threshold as high up as possible. Optimizing her portfolio with 
regards to VaR seems therefore a legitimate request. A critical point is the distribution assumption: 
If the normality assumption does not apply, in particular not to the tails of the distribution, then the 
empirical VaR could be a suitable alternative. 

As mentioned in Section 10.4, minimizing the empirical VaR does not come with a convex, but 
a rather rough search space. The following experiments, therefore, use Differential Evolution as the 
optimization method. 


14.4.3 Numerical results 


The problem of overfitting 


VaR looks, by definition, at rare events. This causes problems when the sample is rather small: 
If there are 1000 observations and a = 5%, then the VaR is the threshold between the 50 lowest 
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FIGURE 14.8 Distribution of end-of-horizon wealth in a VaR setting. 


and the remaining 950 upper observations. As we have seen in Section 9.2, that can cause prob- 
lems when considering a single asset where observations are given. In a portfolio optimization, the 
weights can be chosen such that the resulting combinations and observations look as favorable as 
possible, according to the objective function. If the objective function is looking at these rare events 
only or at this very peculiar threshold, then there is an incentive to exploit the properties of the data 
in this combination process. 

Fig. 14.8 illustrates this for a portfolio with N = 40 assets, based on S = 1000 available ob- 
servations with H = 5 days each; asset prices follow a geometric Brownian motion with normally 
distributed innovations. The training and validation samples come from the same data generat- 
ing process but are independent of each other. The left column shows the histograms for the 
in-sample (top row) and out-of-sample (center) end-of-period returns for an equally weighted port- 
folio. The weights are predetermined irrespective of particularities in the data. Differences between 
the in-sample and out-of-sample VaR for the given æ = 5% should therefore be nothing more than 
sampling noise. Looking at the lower end of the empirical cumulative density (graph in the bottom 
row) confirms this. 

The histograms in the center column are for the portfolio with the same sample data but opti- 
mized for VaR under a normality assumption. In the sample, the asset returns’ expected means and 
covariances will deviate from their true values, but typically not by a great deal if the estimation 
is based on a sufficiently large sample. Nonetheless, the optimization process will prefer assets 
with higher expected return (that shift the distribution to the right) and low volatility (that make 
large losses less likely). Also, it will prefer combinations of assets with higher correlation because 
that enhances diversification. However, it can not distinguish whether these favorable statistics are 
genuine or just sampling artefacts, so it tends to fall for artefacts. With empirical data, it is not 
always easy to spot such artefacts. Given the data generating process, all assets should have the 
same statistics, so, in fact, if one asset looks more favorable than the other, however, that distinc- 
tion is simple: all assets, by construction, do have the same means, volatilities and correlations, 
so all deviations really are just an artefact, and it is unlikely that the same artefacts persist in the 
control sample. Consequently, the portfolios are less than perfectly diversified, the out-of-sample 
risk will be higher than anticipated, and the return lower. In combination, the out-of-sample VaR 
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at the same shortfall probability a will be higher (i.e., the end-of-horizon wealth lower), and the 
in-sample-VaR will be exceeded with a higher probability out-of-sample.!° The reliability of the 
distribution assumption will also play an important role in the prediction of the actual VaR. 

Under the normality assumption, the VaR was based on the mean and standard deviation of 
all observations; when based on the quantile of the empirical distribution, VaR looks only at a 
small fraction of the sample. With a shortfall probability of 5% (that is, the worst 50 out of 1000 
observations) and 40 assets to choose from, the optimization leads to a portfolio that seems to 
have a rather high end-of-horizon threshold and a correspondingly low VaR. A closer look at the 
properties of this portfolio, however, reveals that this outcome is misleading. The right column 
in Fig. 14.8 highlights the problems. The optimization resulted in a portfolio that has unusually 
many observations just above the threshold: By cleverly choosing the weights, observations that 
are very near the threshold have been made to look less problematic. As a result, these either do not 
count towards the VaR, or to push the quantile up and lower the VaR. This is substantially heavier 
overfitting than in the previous case under the normality assumptions: the optimization does not 
just exploit deviations in the aggregate statistics of some assets, it exploits particularities in some 
rare individual observations. These are even less likely to persist in the validation set, and the gap 
between in-sample and out-out-sample VaR widens, same with the shortfall probability. In passing, 
note that the means and the standard deviations of the in-sample and out-of-sample distributions 
are, in fact, not too different, even though the empirical cumulative density at the lower tail is. 


A closer look 


To confirm if this result is true for just this one case, or if it is systematic, one can run repeated 
experiments. We repeated the experiments for different values of œ between 1% and 10%; with 
S € {500, 1000} and N € {5, 10, 20, 40} and for investment horizons of H € {1,5} representing just 
one day or one week. In addition to normally distributed returns, alternatives included r; from a 
t-distribution with v = 4 degrees of freedom, which produces samples with fat tails; these were 
scaled such that they exhibit the same mean and returns as their normally distributed counterparts. 
And returns were resampled via a block-bootstrap on empirical returns for S&P500 stocks, based 
on daily adjusted closing prices from Jan 2010 to Nov 2018, as provided on finance.yahoo.com. 
While there were quantitative differences, the qualitative findings persisted: Optimizing under the 
empirical VaR can and will backfire: out-of-sample shortfall frequencies are higher than their in- 
sample counterparts, and the gap grows larger as the investor chooses from more assets and as the 
training sample becomes smaller. 

Table 14.1 shows how wide this gap is. Based on approximately 100 independent simulations 
per case, it presents by how much, on average, the expected shortfall probability œ is exceeded in 
the out-of-sample control sample, all for a H = 5 days investment horizon. The case study above 
was for normally distributed returns, given a sample size of S = 1000 and N = 40 assets. The 
average exceedance when optimized under a normality assumption with œ = 5% is 0.74%, i.e., the 
threshold will be fallen short of with a probability of 5.74%. When optimized for the empirical 
in-sample VaR, then the gap is 3.84%, implying that instead of the assumed 5%, the actual shortfall 
probability is 8.84%. For a = 1% and other things equal, the gap is 2.34%, leading to an actual 
shortfall probability of 3.34%, i.e., more than three times the required probability. 

For all cases under the empirical VaR, these deviations are positive and significantly different 
from zero, at least at the usual 5% confidence level, typically at substantially stricter levels. Under a 
VaR with a normality assumption, these gaps are also almost always positive, but not that wide, and 
in some cases not significantly different from zero. All of these gaps come from exploiting artefacts 
and noise in the in-sample data with a susceptible risk measure. All these gaps should be equal to 
zero. As a test that these gaps emerge due to a flaw in the objective function, one can analyze 
the situation for non-optimized, equally weighted portfolios (details for this are excluded from the 
table). There, in virtually all of the considered settings, the gaps are not statistically different from 
zero; and the number of cases that are (14 out of 200) is in line with the chances for a Type II error 
(false positives). So it is safe to conclude that VaR as an objective function does have its flaws. 


16. Section 14.2.4 investigates this in more detail for mean-variance-optimized portfolios. 


418 PART | III Optimization 


TABLE 14.1 Average exceedance of out-of-sample shortfall probabilities compared to the tar- 
get æ (in-sample shortfall probability). First column is the number of assets, while values in 
italics are those that do not differ significantly from 0 (5% confidence level). 

VaR empirical distribution, «=... VaR normal distribution, « =... 
0.01 0.02 0.05 0.1 0.01 0.02 0.05 0.1 
geometric Brownian motion (returns follow normal distribution) 


sample size S = 500 


5 1.13% 1.49% 1.98% 2.68% 0.14% 0.19% 0.24% O20% | 
10 1.92% 2.44% 3.37% 4.45% 0.18% 0.19% 0.41% 0.67% 
20 2.88% 3.48% 4.80% 6.19% 0.35% 0.46% 0.87% 1.10% 
40 3.61% 4.45% 5.93% 7.46% 0.47% 0.66% 1.38% 2.06% 
sample size S = 1000 | 

5 0.67% 0.92% 1.23% 1.56% 0.03 % 0.07% 0.09% 0.09% 

10 1.29% 1.61% 2.12% 2.94% 0.06% 0.17% 0.17% 0.33% 
20 1.92% 2.33% 3.22% 4.08% 0.19% 0.26% 0.38% 0.64% 
40 2.34% 2.86% 3.84% 4.95% 0.25% 0.42% 0.74% 1.14% 


geometric Brownian motion (returns follow t-distribution) 
sample size S = 500 


5 1.22% 1.40% 1.79% 2.59% 0.18% 0.17% 0.16% 0.20% 
10 1.92% 2.51% 3.29% 4.19% 0.26% 0.37% 0.45% O.71% __ 
20 BINT 3.42% 4.60% 6.08% 0.35% 0.41% 0.75% Oe | 
40 3.48% 4.28% 5.49% 7.00% 0.48% 0.81% 1.16% 1.68% 
sample size S = 1000 

5 0.71% 0.87% 1.17% 1.61% 0.10% 0.03% 0.13% -0.03 % 
10 1.32% 1.61% 2.28% 2.83% 0.11% 0.11% 0.23% 0.28% 
20 1.77% 2.31% 3.17% 3.93% 0.14% 0.22% 0.37% 0.65% 
40 2.35% 2.85% 3.74% 4.88% 0.31% 0.44% 0.72% 1.05% 


empirical data (returns block-bootstrapped from S&P 500 stock returns) 
sample size S = 500 


5 0.69% 0.90% 1.44% 2.07% 0.13% 0.17% 0.34% 0.48% 
10 1.05% 1.52% 2.27% 3.09% 0.20% 0.26% 0.44% 0.55% 
20 1.48% 2.07% 3.04% 4.23% 0.27% 0.36% 0.59% 0.89% 
40 1.92% 2.60% 3.68% 5.17% 0.20% 0.42% 0.67% 1.00% 
sample size S = 1000 

5 0.51% 0.67% 0.92% 1.35% 0.12% 0.14% 0.11% 0.17% 
10 0.68% 0.94% 1.45% 2.05% 0.13% 0.20% 0.40% 0.31% 
20 0.80% 1.21% 1.78% 2.48% 0.11% 0.17% 0.30% 0.54% 
40 1.05% 1.43% 2.15% 2.99% 0.12% 0.20% 0.27% 0.51% 


There are some remedies. The first and most obvious one is not to use VaR in optimization; 
at least not with small samples and when computed as the empirical quantile. Using parametric 
distributions helps (even if they are not perfect for the data at hand and as long as one acknowledges 
the differences between parametric and actual thresholds). Increasing the sample size also helps, 
but is easier said than done. In practice, the number of available observations is often limited, in 
particular ones that can be considered representative of the investment horizon. And what is a large 
sample, depends on how many assets there are and how small « is. Also, one can look at alternatives 
to VaR. A close relative is Expected Shortfall which looks at the distribution beyond the threshold. 
By and large, it is (slightly) favorable to VaR, but still does allow for some data fitting. 
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Appendix 14.A Computing returns 


Suppose we have a sample of prices of na assets, with ns + 1 observations. We collect them in a 
matrix P such that the assets are in the columns; each row holds the prices for a given scenario, for 
instance, historical observations. Keep in mind that if we use these historical moments as inputs to 
our model, we make—implicitly or explicitly—the assumption that the underlying processes are 
stable enough that we can estimate these parameters. MATLAB and R code: 


Listing 14.7: C-PortfolioOptimization/M//Ch13/returns.m 


%% Time-stamp: <2018-05-10> 
%% (also tested with Octave 4.2.1) 


%% generate artificial price data: 
R = returns, P = prices 


COMI DMNMBPWNeE 
oe 


ns = 100; % number of scenarios 

na = 10; % number of assets 

R = 1 + randn(ns, na) * 0.01; 

P = cumprod( [100 * ones(1, na); R] ); 
10| plot (P); 


12| %% discrete returns 
13|% compute returns: rets should be equal to R 
14| rets1 = P(2:end,:) ./ P(1:(end-1),:) ; 

15|% ... or 

16| rets2 = diff(P) ./ P 


(1: (end-1),:) + 1; 
17| max (max (abs (rets1-R) ) ) 


% not ‘exactly’ equal 


18| max (max (abs (rets2-R))) % not ‘exactly’ equal 
19| max (max (abs (rets1-rets1))) % ‘exactly’ equal 
20 


21| %% log-returns 
22| rets3 = diff(log(P)); 


23|% ... almost like discrete returns 

24| plot (rets1(:) - rets3(:) - 1) 

> ## generate artificial price data [returns] 
> ## R = returns, P = prices 

> ns <- 100 ## number of scenarios 

> ine) <= AL(0) ## number of assets 

Sass eS AL ah elsereeny(icietora (miss <2 evel, SCL = 10), @)), 


Glin = PEASE) 


> P <- rbind(100, R) 

> P <- apply(P, 2, cumprod) 

> imbicjolkcie(I2, yS = WIL a one ce aoet tone oO ((zAeKe)2 szo) 
> ## discrete returns 

> ## compute returns: rets should be equal to R 

> retsl <- P[-1, ] / Pl-nrow(P), ] 

> ## 2... or 

> rets2 <- diff(P) / P[-nrow(P), ] + 1; 

> max(abs(rets1-R) ) ## not exactly equal 


[1] 2.22e-16 


> max(abs(rets2-R) ) ## not exactly equal 


[1] 2.22e-16 


> max(abs(retsl-retsl1)) ## exactly equal 


[PMwR-returns ] 


[Xy] 
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[1] 0 


> ## log-returns 

> rars) <— ditt (log (P))i- 

> ## ... almost like discrete returns 

> plot(c(retsl1) - c(rets3) - 1, cex = 0.5) ## not shown 


There is also a large number of R packages that allow you to compute returns. The PMwR 
package, which we will look at in Chapter 15, provides one such function. 


> library ("PMwR" ) 
> returns(1:5) 


[1] 1.000 0.500 0.333 0.250 
= iaSicwrcas) (Clomiael((ile'),, L85) 


[,1] [,2] 


[1,] 1.000 1.000 
[2,] 0.500 0.500 
[37%] 0.333) 0.333 
[4,] 0.250 0.250 


The variance—covariance matrix can be computed from these returns. For the textbook method, use 
the function named cov in both MATLAB and R. 


Appendix 14.B More implementation issues in R 


In this appendix we discuss several points about how to implement the objective function. The 
discussion is general and thus relevant for all optimization functions in NMOF (DEopt, LSopt, 
PSopt, TAopt, SAopt). 


14.B.1 Scoping rules in R and objective functions 


oe? 


Scoping rules detail how a symbol, such as “x”, is associated with a value. R’s scoping rules are 
described in Gentleman and Ihaka (2000). 

The intuitive idea of a function is that of a piece of code that takes input, and returns an output. 
Concretely, an objective function takes the parameters of the model, and also data and perhaps other 
pieces of information; it then returns a real number that characterizes the quality of our solution. In 
R we can create a function in such a way that it already includes the data and only takes the model’s 
parameters as inputs. What is special about R (compared with MATLAB) is that such a function 
could be created as the output of another function. An example: we create random data for a linear 
regression; the matrix X has nC columns, and nR rows. 


> nR <- 6 
> me <= 3 
= a array (iciavoram (alse) , clin = C(x, ite) )) 
> y <- rnorm(nR) 
> xX 
[,1] [,2] [,3] 
[1,] -0.5064 -0.537 -0.697 
[2,] 1.1439 0.902 1.634 


[3,] -1.4853 0.487 -0.422 
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[4,] 1.2528 0.243 0.109 
[5,] -1.1776 0.385 -0.668 
[6,] -0.0086 -0.916 1.518 


Say we want to minimize the residuals of X8 = y not with Least Squares, but we wish to 
minimize the maximum absolute residual. An objective function would need a vector param (the 
coefficients), the matrix X, and the vector y. 


> OF1 <- function(param, X, y) [OF1] 
max(abs(y - X %*% param) ) 


The function createF takes our data (X and y), and then it defines a new function. 
> createF <- function(X, y) { [OF2] 
function (param) 
max(abs(y - X %*% param) ) 
} 
> OF2 <- createF(X, y) 


OF2 takes only param as arguments, but not X and y. However, OF2 can still access these 
objects, because they have been present in the environment in which OF2 has been defined. 


ZREN 


function (param, X, y) 
max (abs (y - X %*% param) ) 


= O17 


function (param) 
max(abs(y - X %*% param) ) 
<environment: 0x55816ea913d8> 
Both OF 1 and OF2 should give the same results. 
> param <- rnorm(nC) [compare-OF] 
> OFl(param, X, y) 
PUT 2.25 


> OF2 (param) 


Lay 2.15 


This could be a “trap.” When R evaluates a function call and finds that an object does not exist 
in the current environment, it will search the enclosing environments. So let us remove X and y 
from the global environment. 


> removet lige = (Tx, My7)))) [remove-Xy] 


> try(OF1 (param, X, y)) 
> OF2 (param) 


Pa} ZSS 


We may still access X and y. 


[get] 


[omega] 


[pmin-pmax] 
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> Ils(envir = environment (OF2) ) 


[1] uyn yy 

> get("X", envir = environment (OE2) inherits = FALSE) 
[,1] [,2] les 

[1,] -0.5064 -0.537 -0.697 

[25.4 1.1439 0.902 1.634 

[3,] -1.4853 0.487 -0.422 

[4,] 1.2528 0.243 0.109 

[5,] -1.1776 0.385 -0.668 

[6,] -0.0086 -0.916 1.518 


We suggest being careful with R’s scoping rules. If possible, pass all required objects directly 
into the function, or explicitly create closures, as we did here. 


14.B.2 Vectorized objective functions 


Local Search and Threshold Accepting are trajectory methods; they develop a single solution at a 
time. The basic structure of the objective function is clear: pass a solution, and receive a number. 
When we use population-based heuristics, we need to decide how we implement the objective 
function: we can either write the function for a single solution (i.e., one column of the population 
matrix), or pass the whole population into the function and only then evaluate the single solutions. 

A function for a single solution is more user-friendly and allows for easier reuse of code; inter- 
nally the optimization procedure could then call a function of the apply family and evaluate the 
objective function for the columns of the population matrix. But if we go for speed (and we often 
should with heuristic methods), we should better directly pass the whole population into the objec- 
tive function, which then returns a vector of numbers. This makes it slightly more complicated for 
users to write alternative objective functions since now they have to handle the evaluation of the 
whole population themselves. But the advantage is that in some cases truly vectorized evaluation 
is possible. (apply functions are still loops, see Chambers, 2008, Chapter 6.) In our particular 
application, this includes the question when and how to compute the product Rw (or Pu). 

An example can illustrate this. Suppose we wish to minimize a ratio of partial moments of order 
one, the so-called Omega ratio (Keating and Shadwick, 2002). For a given sample of portfolio 
returns r? = Rw, this can be computed as 


>> max(6 — r?, 0) 
>> max(r? — 6, 0) 


where @ is a threshold, the watershed between losses and gains. An R function could look like this: 


> omega <- function(r, theta) { 
rr <- r - theta 
-Smar = alasi] / scimie a Elloys} (3272) )) 


} 


We have used the fact that for some vector rr, computing pmax(rr, 0) gives the same result 
as (rr + abs (rr) ) /2. This is generally faster than using a logical expression such as 


> meee s- “Sumit iieliae <= 0]| )) 7 sii cella S 0] )) 
or directly using pmax and pmin. 


> library ("rbenchmark" ) 
> x <- rnorm(1000) 


> all.equal (pmax (x, 


pT 


N 


[1 


] TRUI 


jE] 


benchmark (pmax (x, 


test replications 


x + abs(x) 
pmax(x, 0) 


order = 


all.equal (pmin(x, 


] TRUI 


Veal 


> benchmark (pmin (x, 


2 
I 


Vig BiG gee EEN 


[1 


test replications elapsed relative 


x - abs(x) 
pmin(x, 0) 


0) 


0), 


* 


2, 


RA O SLES) 


x + abs(x), 
replications = 20000, 
Tresulkeveskyey”)) Tz izaki 


lapsed relativ 


20000 0.050 1.00 
20000 0.199 3.98 
@)) = Ay o% S ellos (o%<))) 
Oly 8 = Glos (G2) y 
replications = 20000, 
oree = Trelkoriva |, abe“) 


20000 
20000 


0.050 1.40 
On AS S25) 
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It is then straightforward to evaluate a single portfolio. We create an artificial data set R, a 
random portfolio w, and compute its Omega ratio. 


w <- runif (na) 
w <- w / sum(w) 
# compute returns 


omega (rp, 


Lezi 


rp <- R 3+% W 


# compute omega 


theta = 


# create artifical data 


0.001) 


# = ...ns = number of scenarios 
# ...na = number of assets 

ns <- 200 

na <- 100 


# set up a random portfolio 


R <- array(rnorm(ns*na)*0.05, dim = c(ns,na) ) 


Now, we have a whole matrix P of solutions. We can compute Omega for this whole matrix like 
follows. 


> ## objective function, 
funceronln eheca) tf 
r - theta 


> 


omega2 <- 
rr <- 


omega2 <- 
omega2 


alternative 


-colSums(rr - abs(rr) ) 


[omega2 ] 


/ colSums(rr + abs(rr) ) 
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> ## check: compute omega 


> omega2 (rp, theta = 0.001) 


EL] 2 FL 


We can test several evaluation strategies. 


## set up a random population 
## ...nP = population size 
nP <- 200 


P <- array(runif(na*nP), dim = c(na, nP)) 
P <- P / outer(numeric(na)+1, colSums(P)) # budget constraint 


WOW WE OE AY 


> ## evaluate population 
= geo) <= UR Sass IP 
> loop_over_cols <- function() { 
ans <- numeric (nP) 
for (i in seq_len(nP) ) 
ans[i] <- omega(rp[, i], theta = 0.001) 
ans 
} 
> benchmark (loop_over_cols() 
apply(rp, 2, omega, theta = 0.001), 
omega2 (rp, theta = 0.001), 
onaclene = “relativt, 
mjolliceencns. = OOO), E 3,4) ) 


test elapsed relativ 


3 omega2(rp, theta = 0.001) 0.290 1.00 
ue loop_over_cols() 0.734 2 53 
2 apply(rp, 2, omega, theta = 0.001) 0.837 2.89 
> rp <- R $*% P 

> al <- loop_over_cols() 

> a2 <- apply(rp, 2, omega, theta = 0.001) 

> a3 <- omega2(rp, theta = 0.001) 

> all.equal(al, a2) 


[1] TRUI 


Gl 


> all.equal(al, a3) 


[1] TRUE 


Each time, we evaluate the columns of the matrix P. (We repeat this 1000 times to get a more 
reliable time measurement.) The first variant is the most straightforward. We loop over the columns, 
and pass each solution into the objective function. In the second variant we use apply. However, 
the most efficient implementation is the third. In all timing tests, we excluded the matrix product 
Rw. This can be justified, since it needs to be computed for all variants. 


Appendix 14.C A neighborhood for switching elements 


[switch] > N_switch <- function(x, Data) { 
Owe, @> xe 
in.i <- which(x) 
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out.i <- which (out) 


if (!length(out.i) ) 

return (x) 
Tm <$= aialgat [semioiley, aie (Cleraefelan( iia wat.) Sal = 
j <- out.il[sample.int(length(out.i), size 
za <= a] 
zI s t 


nod 
RR 


A little helper function, borrowed from the NMOF manual. 


SS iolimorelicSlbjorealeeilcy <= iUialetemomi(s<, Ww, so5)) 4 
argsL <- list(...) 
if (!("sep" %in% names (argsL) ) ) 
argsL$sep <- "" 
dorea NEEC aE 
cS e ENa aS NEE gerie 
"\n", as.integer (y), 
N TIT SES EO Sie, PS NT 
argsL) ) 
message("The vectors differ in ", 
roms. he yA) , 
" place(s).") 
invisible(sum(x != y)) 


Let us try the functions. 


S se 6S TO = O45 
s 5 


= 
| 
vel 
q 
Hy 
D 


SE TRUE FALSE FALSE FALSE FALSE FALSE 


Koj 
hy 
D 
n 
+ 
hy 
D 
n 
T 


> N_switch (x) 


[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE 


> compareLogicals(x, N_switch(x) ) 


1010000000 
1000001000 


A A 


> compareLogicals(x, N_switch(x) ) 


1010000000 
0010000010 


A ^ 
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> compareLogicals(x, N_switch(x) ) 


1010000000 
0010000100 


A A 
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15.1 What is (the problem with) backtesting? 


Backtesting means simulating an investment strategy on past data. Most of the time, the word is 
used in the context of systematic or rule-based investment strategies. Such trading had often been 
called mechanical in the 1960s, and today it goes by the name of automated, algorithmic or quant 
trading. 

Simulating historical performance is, in principle, not limited to systematic strategies: if a dis- 
cretionary trader has kept a diary of his trades, we might as well simulate his trading history. Or 
we may test the predictions that a newspaper has made in the past, and simulate the financial out- 
comes of its investment advice. There is, however, a difference between tracking a newspaper’s 
recommendations and running a systematic strategy on past data: the first was live, though its per- 
formance has not been measured; the second was not. This chapter is about the second, non-live 
type. The software we describe may, however, be used for the first type as well. 

Clearly, the types of trading that we mentioned encompass a wide range of strategies, from 
algorithms that operate at ultra-high frequency at the order book level, to fundamentals-based in- 
vestment models that are rebalanced once per year. There is no way that we could treat them all in 
a single chapter (not to mention that we could not plausibly be knowledgable enough in all those 
different types of trading); we shall thus limit ourselves to several examples, which will also give 
away the area in which the authors of this chapter have been working. 

So here is the plan for the chapter. Throughout the first part, we will try to make the point 
that backtesting is difficult and perilous: preparing the inputs, running the backtest, and finally 
interpreting the results comes with pitfalls and problems. But the purpose of the chapter is not to 
deter you from running backtests: backtesting is a necessary step to gain insight into a strategy, and 
it is an indispensable part of empirical work. Thus, the goal is not to deter, but to raise awareness 
of some of the pitfalls—and to suggest possible remedies. In the second part of the chapter, we will 
describe software for running backtests,' and illustrate the software through several examples from 
US equity markets. 

But as we said, we start with the problems. 


1. The description builds on Schumann (2008—2018c). 
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Oct 19, 1987: -20.5% 
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FIGURE 15.1 Daily returns of the S&P 500 between January 1970 and December 2017 (11,976 days). On October 19, 
1987 the index dropped by more than 20%. 
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FIGURE 15.2 S&P 500 between January 1970 and December 2017 without 10 best and 10 worst days. Having missed the 
best 10 days in the market would have led to a cumulative underperformance of more than 50%, or 0.9% p.a., against the 
market. That is, an investor in the S&P would have ended with a wealth more than 50% higher. Avoiding the 10 worst days 
would have led to an outperformance of more than 80%, or 1.3% p.a., against the market. 


15.1.1 The ugly: intentional overfitting 


In many areas of the financial world, the word backtest has a very bad reputation. The reason is 
simple: run enough backtests, and you are bound to find strategies that look good in-sample, i.e. 
on a particular data set. Even worse, the number of backtests you have to run for good results is 
surprisingly small. 

Financial models are what numerical analysts call sensitive: small changes to the inputs of 
computations often lead to much larger changes in the results. A single day, for instance, can have 
a large impact even over a long time period: Just think what a difference it makes to include or 
exclude October 19, 1987 into a study on equity market returns; see Figs. 15.1 and 15.2. 

This sensitivity implies that changes, even miniscule ones, to the configuration of a backtest 
often lead to meaningful changes in the results. Some such changes will, by mere chance, be pos- 
itive. It is thus very likely that in the neighborhood of essentially any strategy, one may find even 
better configurations, making it tempting and easy to “improve” a strategy by changing parameters. 
There typically is no real improvement, of course; only improved statistics for a particular dataset. 
Nevertheless, it opens the door for fiddling with a backtest until it looks good. At the same time, it 
becomes more difficult to evaluate the robustness of strategies. 

Let us make an experiment: run a strategy on random data. For this experiment we make use of 
packages NMOF and PMwR (Schumann, 2018b), which we first load and attach. We also set a seed 
to make the results reproducible. 


> library ("NMOF" ) 
library ("PMwR" ) 
set.seed (2552551) 


WO N 


2. See Example 1.2 on page 12. Extreme events may be rare, but their impact is large, even when downweighted by their 
frequency. What is more, if such events affect returns, the impact will propagate because of the geometric chaining of 
returns. 
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We define a function randomPriceSeries, which creates a simple random walk by chain- 
ing together Gauss-distributed returns with zero mean. The created series always has an initial 
value of 100. For intuition, think of daily observations of a stock price. (See Chapter 6 for how to 
create such series.) 


> randomPriceSeries <- [rnd-series] 
function(length, vol = 0.01, demean = FALSE) { 


x <= cumprod(1 + rnorm(length - 1, sd = vol)) 
scalel(c(1, x), centre = demean, level = 100) 
} 


The function allows us to define a return volatility (argument vol), with default 1 %, which would 
translate into an annual volatility of 16%. We create and plot a sample path, shown in Fig. 15.3. 


> x <- randomPriceSeries (250) [rnd-series-fig] 
= plots, toe = Us", leo = 84, Wikelo = Use7, )} 


Actually, if you run the code as it is shown here, you will find that your plot looks slightly different 
from the one printed in Fig. 15.3. That is because before calling plot, we called par with several 
settings, collected in a list par_btest. 


> par_btest [par] 


Sbty 


$mgp 
1] 2.0 0.5 0.0 


Stck 
1] 0.01 


$ps 
1] 9 


Before every plot in this chapter, we invoke 
> do.call(par, par_btest) 


(or a slight variation of it). In this way, we may reuse the settings for other graphics in this chapter. 
We typically omit the line in the shown code; you can, however, find it in the chapter’s R file. Also, 
in this chapter we often plot different data in essentially the same way; so we will typically show 
the code the first time we plot, and then omit it later. Again, the complete code is in the chapter’s R 
file. 

Let us test a simple technical trading system on this random series. We compute two moving 
averages, 109 and m39. The subscripts stand for the order of the averages (again, think of trading 
days). Whenever 719 > m3ọ we invest our total wealth in the asset; otherwise we do not hold a po- 


3. The function also has an argument demean: if set to true, the series will be scaled so that its total return is zero. 


[ma-crossover] 


[experiment1] 
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FIGURE 15.3 A random series created with function randomPriceSeries. 


sition and keep cash instead. The function MA_crossover implements the strategy and simulates 
its results on a series S. 


Sin <= CALO BO 


> MA_crossover <- function(m, S) { 
m.fast <- MA(S, m[1], pad = NA) 
m.Sslow <- MA(S, m[2], pad = NA) 


crossover <- function() { 
if (m.fast[Time()] > m.slow[Time()]) 
i) 
else 
0) 
} 
tail (btest(S, signal = crossover, 


ly = 60, aimee casi = A100, 
convert.weights = TRUE) Swealth, 
jak = iL) 


} 


MA_crossover first computes two moving averages. It then defines a new function, crossover, 
and passes it to function btest. We shall explain later in this chapter what exactly the code does. 
It should suffice for now that it returns the final wealth of an investment that starts with an initial 
wealth of 100. Let us try the function with our random series. 


> MA_crossover(m, x) 


bly. 92:29 
Let us try another series. 


> x <- randomPriceSeries (250) 
> MA_crossover(m, x) 


KET 19:3:.5 


At first sight it should not come as a surprise that the strategy did not work well: after all, the asset’s 
price is random, and we know there is no exploitable pattern. (We did not deduct transaction costs.) 
But let us run the experiment 100 times, and see what happens. 


Sin <= el, 30) 

> initial_wealth <- 100 

> buyhold_profit <- final_profit <- numeric(100) 
= arene (ay ra Sa alon (Geminveull jonctencisie)))), 
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FIGURE 15.4 Distribution of difference in final profits MA-crossover strategy vs buy-and-hold. A positive number means 
that MA-crossover outperformed the underlying asset. The distribution is symmetric about zero, which indicates that there 
is no systematic advantage to the crossover strategy. 


S <- randomPriceSeries (250) 
final_profit[i] <- MA_crossover(m, S) - initial_wealth 
buyhold_profit[i] <- S[length(S)] - S[1] 


We store the final profit of all runs in a vector final_profit. The profits of simply holding 
the underlying random series are stored in buyhold_profit. To control for a rising underlier, 
and to see how the strategy fared, we summarize the differences in final profits, i.e. strategy profit 
minus buy-and-hold-profit. (Alternatively, we might have set demean to TRUE when we created 
the series.) 


> summary (final_profit = buyhold_profit) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
-29.3 -8.3 0.9 0.6 7.8 32.4 


The following code block computes the distribution of the profit differences and plot it; see 
Fig. 15.4. Roughly half of the strategies outperformed the asset. 


> plot (ecdf(final_profit - buyhold_profit), 
maan = FU 
peh = NAY 
verticals = TRUE, 
xlab = paste("Final-profit difference", 
"(crossover minus buy-and-hold)") ) 


0, 
ecdf(final_profit - buyhold_profit) (0) ) 


> abline(v 
h 


Fig. 15.5 shows a scatter plot of the buy-and-hold profits and the strategy profits. We use function 
eqscplot from package MASS, which automatically chooses equal scales on both axes. 


> library ("MASS") 

> do.call(par, par_btest) 

> eqscplot (buyhold_profit, final_profit, 
meia = 9 joel = IS), Cee = 0.6) 

=> dollimaliy = ©, in = 0) 


In sum, running the crossover strategy on random data had about a 50/50 chance of outperform- 
ing the asset on which it was based.* When you reflect on the result, it may appear obvious: there 


4. In real life, the chances may be lower because of transaction costs and because of the properties of a strategy: for instance, 
a strategy that is mostly short in a rising market will mostly lose out. 


[rnd-profits] 


[rnd-profits-corr] 


[ma-crossover-opt] 
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FIGURE 15.5 Final profits of MA-crossover strategy vs result of buy-and-hold of the same underlying series. 


is no systematic bias in the random underlying time-series, and hence the crossover strategy could 
not pick up any advantage. But the purpose of the experiment was not to compute the average out- 
come. Instead, we wanted to show the range of possible outcomes: many realizations did poorly; 
but others did very well, merely because of chance. Imagine yourself in a world with tens of thou- 
sands of quant traders (quant monkeys) who test different strategies on different assets. Even after 
transaction costs and other hurdles, a large number of apparently successful strategies will remain. 

Let us run a second experiment. Before, we used 10 and 30 as the parameters for the strategy. Let 
us see what happens when we choose the parameters in a more “optimal” way. We fix admissible 
ranges for the parameters: | to 20 for the fast moving average, and 21 to 60 for the slow one. Since 
we have only two parameters to check, we may run a backtest for each combination and keep the 
best backtest. One possibility to run such an optimization is through a nested loop: one loop for 
each parameter. The function MA_crossover_optimized will do that for us. 


> MA_crossover_optimized <- function(S) { 
fasic <= 1620) 
slow <- 21:60 


crossover <- function() { 
if (m.fast[Time()] > m.slow[Time()]) 
Ji 
else 
0 


} 
best <- -10000 
best par < c(0, 0) 
oe e alia, aeS 1 
m.fast <- MA(S, £, pad = NA) 
iene S aa Silo) A 
m.slow <- MA( 
res <- btest( 


rS oae = NA) 
crossover, b = 60, 
initial.cash = 100, 
convert.weights = TRUE 
if (tail(resSwealth,1) > best) { 
best <- tail (resSwealth,1) 
best.par <- c(f, s) 
best.wealth <- resS$wealth 


S 
S 


i 

attr(best, "wealth") <- best.wealth 
attr(best, "parameters") <- best.par 
best 
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FIGURE 15.6 Optimized backtests. The black lines show the asset; the gray lines show the performance of the MA- 
crossover strategy with optimal parameters. 


Just like MA_crossover before, it returns the final profit for the given dataset; it also returns (as 
attributes) the equity curve of the strategy and the best parameters, which lead to that equity curve. 


> x <- randomPriceSeries (1000) 
> res <- MA_crossover_optimized (x) 


The NMOF package offers a more convenient function for such a computation: gridSearch. 
gridSearch will minimize, so we need to flip the sign of the function we pass in. We can do this 
by wrapping MA_crossover into an anonymous function. 


> res2 <- gridSearch(function(m, S) 
-MA_crossover(m, S), 
S = x, 
levels = inse(ile20, Zils @@)) )) 


Fig. 15.6 shows three examples of random paths and the equity curves of the associated optimized 
strategies. The graphics provide an indication of the typical results we get. But let us be a little 
more systematic and run the experiment 100 times (i.e. on 100 random paths). The variable ini- 
tial_wealth remains 100, as defined before. 


> buyhold_profit_opt <- final_profit_opt <- numeric(100) 
S roD (a dim gee aleng (rral ororie Goel) 1 
x <- randomPriceSeries (1000) 
res_gs <- gridSearch(function(m, S) -MA_crossover(m, S), 
S = xX, 


[grid-search] 


[experiment2] 
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FIGURE 15.7 Distribution of in-sample excess profits for optimized MA crossover strategy. A positive number means that 
MA crossover outperformed the underlying asset. 


levele = ilieie(ils20, 2.1L s60))))) 
final_profit_opt[i] <- -res_gs$minfun - initial_wealth 


} 


Again, we summarize the results; a plot of the distribution of the differences in final profits is 
shown in Fig. 15.7. 


> summary (final_profit_opt) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
-20.1 3i 19.1 19.8 31.9 89.5 


Compare Fig. 15.7 with 15.4: only very few of the optimized strategies now perform worse than 
the asset. This fact—that even simple models with only few parameters may be fitted so that they 
perform well in-sample, even when in truth there was no advantage at all—has given rise to the bad 
reputation of backtesting. Because this fact has been exploited to trick investors into supposedly 
brilliantly-performing strategies, when in truth they were nothing more than backfitting exercises. 

Such intentional overfitting is not the purpose of the book; but we hope the experiment helps 
raise awareness. Despite this dark potential, backtesting is a highly useful technique. The remaining 
part of the chapter offers advice on how to use it. 


15.1.2 The bad: unintentional overfitting and other difficulties 


At the risk of repeating ourselves: Backtesting is difficult. There are many problems besides over- 

fitting. 

e In finance we have little data to learn from. That is at least when compared with the vast amounts 
of data that are available in other areas, e.g. the billions of photos that Google has at its disposal to 
train its face recognition methods, say, all neatly tagged so that algorithms may learn to recognize 
cats or whatever. These kinds of data are simply not available in finance. It is true that once 
we move into high-frequency data, the size of datasets grows quickly. But this only “zooms 
into” existing data: having tick data for a year in which the average stock rose may mean many 
observations, but these are not independent of one another. The situation gets worse when we 
work with lower-frequency data types, such as fundamental or macroeconomic data. 

e We require accurate historical data. For instance, we need to know what assets could be bought 
and sold at what prices; and we need to know how these assets performed during the past. For 
regulated markets in developed countries in recent years, such data are typically available. 

e When we simulate a trading strategy, we may need to account for its influence on the market. 
On a high level, this may mean that we need to make assumptions on price impact. Or think of 
an extreme case: you want to test an algorithm that acts on patterns in the order book. It is much 
more difficult to test such an algorithm on historical data, because if the algorithm acts on what 
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FIGURE 15.8 Performance of EURO STOXX Total Market and EURO STOXX Banks indices between January 2006 
and July 2018. The weak performance of European banks is apparent. Many different models, such as momentum or low 
volatility, might have captured this trend. 
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it sees, it will issue orders, and hence you start to change the historical track-record of the order 
book. 


Once we have run backtests, we need to analyze their results. Suppose we ran tests that actually 
showed that the strategy we are interested in was profitable. Then there are still several points to 
keep in mind. 

Any test depends on time and assets: clearly, in an equity bull market it should not be surprising 
that a strategy with a long bias performs well. It helps to use a realistic benchmark; but it is often 
hard to find such a benchmark. This is particularly a problem when we work with a strategy that 
acts on a large cross-section of assets. Suppose we have run a cross-sectional equity strategy and 
find that it outperformed the benchmark. It may only have been because it picked up certain factor 
exposures: the classic market factor, or Fama—French factors (value, small cap) are candidates to 
test. But the burden of coming up with relevant benchmarks or risk factors lies with the Analyst. 
For single-instrument strategies, similar problems arise, notably related to exposure: it becomes 
much more difficult to evaluate a strategy that is often not invested, or changes leverage frequently. 

As we pointed out above, financial models are sensitive: small changes may lead to very large 
differences in performance. A single trade done or not done may make much of a difference. But 
that makes it difficult to judge whether a strategy’s profits were real or just luck. (Some authors 
have thus suggested to measure only cross-sectional improvements versus a benchmark, because 
there are many more bets than with a general timing strategy. But for one, this requires independent 
bets, and cross-sectional bets often are not independent; also, just because we cannot measure well 
a quantity should not mean we may not try it.) 

This sensitivity also makes it difficult to compare different strategies. We may have an obser- 
vational equivalence between different strategies: different models may capture the same market 
trend. See Fig. 15.8 for an example. For a given sample, it is then impossible to differentiate be- 
tween these models (based on the data alone), and hence, it is difficult to rank models. 

That leaves us with unintentional overfitting. We may do it with the best of intentions, but: 
whenever we run many backtests on a single dataset, we are bound to find some that work, but 
which may quite likely be the result of data-snooping. What can be done? As long as we run tests 
on a single dataset, there is unfortunately little hard advice that can be given." But the situation is 
not hopeless, and some informal rules do help: 


5. A number of papers haved aimed to provide statistics that are adapted to multiple backtests, often following ideas from 
the statistical literature on multiple comparisons. The key message of these papers is sound: after having run multiple 
backtests, one must not trust the results of a single backtest any more. Such papers are sometimes instructive, e.g. when 
they show how very few backtests are needed to obtain good results, or when it is recommended that performance statistics 
be discounted. But some of the suggested statistics are harmful. For one, they provide a sense of exactitude that is not 
warranted. In particular, they are usually based on sampling backtests on a given data set, so they do not cover the case 
of iteratively “optimizing” a backtest. But there is an even worse aspect. A simple rule for evaluating strategies is that if a 
backtest looks too good to be true, it probably is not true. But some suggestions in the literature (e.g. Harvey et al., 2016) do 
the contrary, actually: the authors’ solution is to insist on a higher level of statistical significance than normally used. But 
if we assume that useful strategies do not provide outlandishly-high Sharpe ratios, the proposed statistic would never flag a 
realistic strategy as a good strategy. On the other hand, a strategy that comes with a Sharpe ratio of 10, say, will likely pass 
the test, because of its high statistical significance (the Sharpe ratio is equivalent to a t-statistic). 
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e Try few things only: restrict yourself to few variants of a backtest, if at all. In any case, document 
everything you try, and read and reread those notes. (But do not use formal statistics such as ratios 
of good strategies to tested strategies or similar.) What is more, backtests should be motivated 
by observations in the market, or theory, but not by other backtests (“this did not work, but what 
if I changed this parameter ...”’). 

e Do not just analyze the performance of strategies, but see why they performed well or not in 
particular times or markets. It may be easier to evaluate a strategy when you consider the cause 
for its good performance, and in particular whether this cause will persist (see Fig. 15.8 again) 
or not. 

e Be careful with composite strategies whose single components do not work on their own. 

e Test the sensitivity of strategies: the goal is not to find the best variant of one model, but a model 
that works OK-ish in many different variants. See the examples below. 

e Finally, or perhaps firstly: check your code. For instance, run it with random data to see whether 
any part can “see the future”. 


Beyond working on a single dataset, there are two useful techniques: replication and meta anal- 
ysis. Start with replication: Unfortunately, in the academic world, there is little incentive, because 
replication studies may not be easy to publish (unless you show that some seminal paper cannot be 
replicated). But replicating a backtest with new data, and perhaps also with a different implementa- 
tion, can provide evidence for and confidence in a strategy. Meta analysis is related to replication, 
but broader. It might be summarized as “do not trust a single paper, or a single methodology.” 
Rather, collect published papers (and ideally also unpublished ones) on a certain topic, and look 
at the range of outcomes. Again, in finance, this is made difficult because negative results (“the 
strategy did not perform well”) are rarely published. 

And there is more, once we move beyond pure backtesting. We can set up our software for ana- 
lyzing strategies so that data are updated as time progresses. That is good idea in any case, because it 
keeps the backtesting workflow close to production. And we may run real-time experiments: either 
in the form of paper trading, or real money trading. We may then follow the statistics of a strategy. 
This is similar to the clinical testing that is done in the pharmaceuticals industry. Of course, if we 
were interested only in definitive answers, we would need decades of live data. But we will never 
have definitive answers. Such experiments may nevertheless increment our understanding of and 
confidence in a strategy. And one final remark on overfitting: grossly overfit strategies are usually 
discovered quickly, as their performance breaks down quickly when used on new data. 

Altogether, there is a key insight to be taken from the discussion: in-sample results are not 
just dubious in financial backtests; they are typically worthless. This means that all backtesting 
must include some type of out-of-sample analysis. Of course, if you have followed the examples: 
even such out-of-sample tests, if done repetitively, are not going to save us. But they do help. 
Single splits of a dataset are typically not enough. Instead, for backtests the standard tool is a 
walk-forward. In a walk-forward, we split the dataset into many periods, each with one in-sample 
and one out-of-sample part. We may object at once: since we work on a historical dataset, there 
cannot be truly out-of-sample data. But the key idea is that for each subperiod, only the in-sample 
part may be used for determining the trading parameters. The procedure is summarized in Fig. 15.9. 

Walk-forwards are not free of problems: the in-sample periods are overlapping, and hence dif- 
ferent periods may be affected by a single event. (Out-of-sample periods never overlap.) And again, 
running many walk-forwards is data-snooping just as any other method. But walk-forwards give 
more reliable results, simply because a single method has been applied several times. And as you 
will see in the examples later in this chapter, such walk-forwards will (and should) always be ac- 
companied by robustness checks. 


15.1.3 The good: getting insights (and confidence) in strategies 


The discussion so far may have left the impression that backtesting is a dangerous undertaking. 
That is true; backtesting is difficult. But that does not mean one should despair. 

As we said, there are remedies that may at least mitigate the troubles, such as careful data 
preparation; running only few tests; sensitivity checks; replication and meta analyses. In any case, 
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FIGURE 15.9 Walk-forward. The total time period is split into subperiods 1 to N, and each subperiod is split into an 
in-sample part (light-gray) and an out-of-sample part (dark-gray). The in-sample part may be used for determining trading 
parameters. In the end, the out-of-sample parts of all subperiods are concatenated. 


no matter whether you judge other people’s strategies or your own, stay skeptical and keep rea- 
sonable expectations: When people present strategies with outlandishly-good results (high returns, 
no drawdowns, etc.), or weird parameter values, be cautious. Always compare a strategy to what 
would be possible: Historically, equity markets had reward to risk of perhaps 0.3. It is simply not 
probable that some equity strategy comes with an expected Sharpe ratio of 6. 

With all that said: backtesting is a wonderful tool. It is the only way to make empirical state- 
ments about strategies. Just listen to popular investment advice on TV, or read it in a newspaper: 
these are full of statements such as “a company so cheap must be bought’, or “no company should 
have so much debt”, or “buying after the market declined is always a good idea”. Essentially none 
of such statements are empirically verified or even scrutinized. There is no doubt that looking at 
past performance gives no guarantee for future returns. But when a claim is made that some strat- 
egy has worked, and tests show that it has not, quite a bit has been learnt. So do test strategies. As 
we pointed out in Chapter 1, finance does not suffer from too much empirical research, but from 
too little. Backtesting is much preferred over anecdotal evidence. 


Backtests may also help against visual traps and se- 
lective perception. A good example is a simple strat- 
egy of moving averages, just like the one we used in 
the initial experiments. Any book on technical anal- 
ysis is going to show you a graphic like the one the 
right. But such (carefully chosen) graphics are often 
much more promising than the results of a system- 
atic test of such an indicator, which would be done 
via a backtest. 


Also, backtesting is preferable to only running regressions or computing other summary statis- 
tics as proxies for potential profits. Running a real backtest will make sure all units are right, trade 
logic is followed, and there are no forgotten assumptions. It is much easier to make mistakes when 
we look only at correlations, say, or by proxying potential profits from a strategy by chaining to- 
gether returns. See Blume and Stambaugh (1983) for a well-known example of the latter problem, 
which would not occur with an actual backtest. 


15.2 Data and software 


Let us turn to the implementation of backtesting software. 


15.2.1 What data to use? 


Backtests necessarily work on past data. It is important that the data do not introduce biases, such 
as lookahead. 
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FIGURE 15.10 Gilli and Schumann (2017) study strategies that invest in German equities over the period from January 
2003 to May 2017. Their dataset comprises the components of the so-called HDAX, which bundles the 110 largest stocks 
in Germany. Gilli and Schumann (2017) find that when correcting for survivorship bias, an equally weighted portfolio of 
the components of the HDAX would have performed similarly to the index, as shown in the figure. However, using only the 
most recent components would have led to a survivorship bias of about 7% per year. 


There is a simple principle regarding what information and data can be used: at any point in 
time, the algorithm must only rely on data and information that actually was available. Likewise, 
it must only invest in securities or assets that could have been traded, and only at prices at which 
these assets were tradeable.° 

This principle may be obvious enough, but it is not necessarily easy to follow. Start with what 
data were available. Many databases store the actual timestamps of data, but not when the data 
became available. Think of fundamental data: suppose the earnings of a company are dated De- 
cember 31 of a given year; yet they were certainly not available on that date. The company may 
have reported the numbers on March 12 the following year, say. Unless you followed the company 
directly, but used data provider instead, it may still have taken several more days before the data 
were entered into the provider’s database. To make matters worse, such reports may have been 
restated later, and the provider may only have stored the most recent (restated) version. 

A less obvious problem is the question what assets were available. Suppose we want to test a 
strategy in U.S. stocks, and we define our universe to be that of the S&P 500. We cannot simply 
fetch data on the current components of the index and test our strategy: over time, companies enter 
and exit the index, but these changes are not random. Badly performing companies (which includes 
those that go bankrupt) are excluded, and highflyers enter. See also Bessembinder (2018), who re- 
ports that for the United States the median life-time performance of stocks is worse than that of 
treasury bills. Only looking at the current components of an index thus would introduce a survivor- 
ship bias into our analysis. This bias may not only affect aggregate returns; it may also overstate the 
performance of particular strategies, such as buying-the-dips or momentum. This particular bias has 
been studied in Daniel et al. (2009), and they found that the bias can be huge, with up to 8% higher 
in-sample performance of strategies that suffered from the bias. Gilli and Schumann (2017) ran 
tests for the German equity market over the period 2003-17 and found a bias of similar magnitude; 
see Fig. 15.10. 

A second problem related to the availability of assets, and perhaps a more obvious one, is 
volume. A strategy that invests in small caps, for instance, must make allowance for illiquid stocks. 
Short sales may not have been possible or may have been very expensive. Also, reported prices were 
not necessarily tradeable: bid—ask spreads, for instance, may or may not be reflected in reported 
prices. But either case may need treatment. If only the midprice is reported and used, we understate 
transaction costs. On the other hand, if reported prices are simply last prices, they may reflect bid 
or ask prices, and hence an algorithm may be inclined to learn to buy at the bid and sell at the 


6. An example that often comes up are futures prices. Futures have a limited lifespan, often only a few months. The 
cleanest method is, per the stated principle, to rebalance into the active futures contracts, i.e. to roll over a position. To 
simplify computations, continuous series are often used, which is fine for many strategies, given that they are chained 
together properly (e.g. by using returns). 
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ask. Settlement prices also may require attention: depending on the exchange, these may only be 
computed, i.e. do not constitute tradeable prices (e.g., for German Bund futures, the settlement 
price reported by the Eurex is actually an average price). If in doubt, one may have to check the 
exchange procedures. 

In sum, running backtests requires careful data preparation, which typically takes much time. 


15.2.2 Designing backtesting software 


Writing software for backtesting depends on what you want to test. There are some obvious re- 
quirements: the software should take care of repeated tasks, such as the accounting of trades, and 
it should allow reporting. Beyond that, much is left to the preferences or requirements of the user 
and the particular strategies, and there are different tradeoffs. 


e Flexibility versus convenience: software may rely on building blocks, such as strategy compo- 
nents, which make writing composite strategies faster, though sometimes at the price of being 
less flexible. 

e Reusability of code: ideally, we could use the same code in testing and in production. Depending 
on the overall setup, this may come at the price that quickly testing a strategy becomes more 
difficult. 

e Simplicity: are we simply testing broad ideas, or do we need the fine details of trades, down to 
modeling the order book? Again, there is a trade-off, as more details typically mean that it takes 
more time to implement a strategy. 

e Speed: do we want to use the code as input to an optimization routine? If yes, we may have to 
trade off convenience for speed. 


In any case, we stress the idea of backtesting software, not just algorithms. Backtesting software 
is a great example for code reuse: using the same implementation for different strategies saves time, 
and errors that are found can be fixed everywhere at once. 


15.2.3 The btest function 


In this section, we give a high-level introduction to one possible implementation of a backtesting 
software. This implementation can be found in the function btest in package PMwR, which we 
already used. In terms of the trade-offs described in the previous section, btest is strongly biased 
towards flexibility and simplicity. 

The logic of btest can be summarized via two questions, which at a given instant in time (in 
actual life, “now’’), a trader needs to answer: 


1. Do I want to compute a new target portfolio, yes or no? If yes, go ahead and compute the new 
target portfolio. 

2. Given the target portfolio and the actual portfolio, do I want to rebalance (i.e. close the gap 
between the actual portfolio and the target portfolio)? If yes, rebalance. 


If such a decision is not just hypothetical, then the answer to the second question may lead to 
a number of orders sent to a broker. Note that many traders do not think in terms of stock (i.e. 
balances) as we did here; rather, they think in terms of flow (i.e. orders). Both approaches are 
equivalent, but the described one makes it easier to handle missed trades and synchronize accounts. 

Implementing bt est required a number of decisions too: (i) what to model (i.e. how to simu- 
late the trader), and (ii) how to code it. As an example for point (i): how precisely do we want to 
model the order process (e.g. use limit orders?, allow partial fills?) Example for (ii): the backbone 
of btest is a loop that runs through the data. Loops are slow in R when compared with compiled 
languages, so should we vectorize instead? Vectorization is indeed often possible, namely if trading 
is not path-dependent. If we have already a list of trades, we can efficiently transform them into a 
profit-and-loss in R without relying on an explicit loop. Yet, one advantage of looping is that the 
trade logic is more similar to actual trading; we may even be able to reuse some code in live trading. 
Thus, btest relies on a loop internally. To implement the questions above, it relies on a functional 
approach. 
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In fact, the aim for bt est is to stick to the functional paradigm as much as possible. Functions 
receive arguments and evaluate to results; but they do not change their arguments, nor do they 
assign or change other variables “outside” their environment, nor do the results depend on some 
variable outside the function. You will see examples in the following section. Using functions in 
this way creates a problem, namely how to keep track of state. If we know what variables need to 
be persistent, we could pass them to the function and always have them returned. But we would 
like to be more flexible, so we can pass an environment, as described below. To make that clear: 
functional programming should not be seen as a yes-or-no decision; it is a matter of degree. And 
more of the functional approach can help already. 


15.3 Simple backtests 


(Simple means univariate.) 


15.3.1 btest: a tutorial 


When it comes to computation, much of backtesting is accounting: keeping track of what is bought 
and sold at what time, and valuing open positions. This accounting we shall leave to the function 
btest, which we describe in this section. The function is provided in the R package PMwR, which 
we load and attach first. 


> library ("PMwR" ) 


A useless first example 


We really like simple examples. Suppose we have a single instrument, and we use only its closing 
prices. Let us make up some prices.’ 


> prices <- 101:110 
> prices 


[1] 101 102 103 104 105 106 107 108 109 110 


The trading rule is buy one unit of the asset and hold it forever. (As we said, it is to be a simple 
example.) Here is how we could test this strategy with btest. 


> bt.results <- btest(prices, function() 1) 
> bt.results 


initial wealth 0 => final wealth 8 


Great: we increased our wealth from zero to 8. (Which is a lucky number in China.) But perhaps 
we had better explain what the code did. Let us start with the input. 

The function first takes the prices as an argument. The prices should be ordered in time, and 
they will be used for determining entry and exit prices, and also for valuing open positions. btest 
actually expects open, high, low and close prices, but it can work with close prices alone as well. 
See the appendix on page 479. 

As a second argument, we pass a function, which simply returns 1 whenever it is called. Note 
that we wrote it as an anonymous function, because it was so short. But we could as well have 
defined a named function: 


> fun <- function() 
1 
> bt.results <- btest(prices, fun) 


7. Such prices are genuinely useful when you debug code. 
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This function that we pass to bt est is called the signal function. It must be written so that 
it returns the target, or suggested, position, in units of the asset. The position is only suggested 
because we may, for some reason, prefer not to trade, so this suggested position never becomes an 
actual one. In this first example, the suggested position always is 1 unit. Under the default settings, 
the signal function is called at every instance in time. Since we did not explicitly define time, 
btest loops over the prices and calls signal for every price. 

To make the description more precise, let T be the number of price observations, i.e. here the 
length of prices. Then btest function runs a loop from b+1 to T. The variable b is the burn-in 
and it needs to be a positive integer. When we take decisions that are based on past data, we will 
lose at least one data point, notably when our signals are based on returns, which is why the default 
burn-in is 1. In rare cases (such as our example) b may be zero. More often, it will be larger; for 
instance, when a strategy is based on a moving average. 

Here is an important default: at time t, we can use information up to time t-1. Suppose that t 
were 4, i.e. the loop is at the 4th price observation. We may use all information up to time 3, and 
trade at the open in period 4: 


t time open high low close 

1 HH:MM:SS <--\ 

2 HH:MM:SS <-- - use information 
3 HH:MM:SS <--/ 

4 HH:MM:SS X <- trade here 

5 HH:MM:SS 


We could also trade at the close (notably when we have no open price, as in our example): 


t time open high low close 

1 HH:MM:SS <-- \ 

2 HH:MM:SS <-- - use information 
3 HH:MM:SS <-- / 

4 HH:MM:SS X <-- trade here 

5 HH:MM: SS 


No, we cannot trade at the high or low. Some people like the idea, as a robustness check, 
to always buy at the high, and to sell at the low. Robustness checks— forcing bad luck into the 
simulation—are a good idea, notably bad executions. High-low ranges can inform such checks, but 
using these ranges does not go far enough, and is more of a good story than a meaningful test. 

With the inputs and the general workings described, let us show a little more output. The result 
of calling btest is a list of several components, notably wealth (the equity curve) and posi- 
tion (a matrix of the positions that the strategy has held). These components are by default not 
printed. Instead, a simple print method displays some information about the results: 


initial wealth 0 => final wealth 8 


In this case, it tells us that the total equity of the strategy increased from 0 to 8. The function 
trade_details extracts data from bt .results and prints them as a table. 


> trade_details <- function(bt.results, prices) 


data. frame (price = prices, 
suggest = bt.results$Ssuggested.position, 
position = unname(bt.resultsS$position), 
wealth = bt.resultsSwealth, 
cash = bt.resultsScash) 


> trade_details(bt.results, prices) 


price suggest position wealth cash 


T 101 0 0 0 0 
2 102 A 1 0 -102 
3 103 1 1L Ai = 102 
4 104 1 £ 2 -102 
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5 105 1 1 3 -102 
6 106 I L 4 -102 
7 10:7 al 1 5 -102 
8 108 1 dy 6 -102 
9 109 pi 1 7 -102 
10 110 1 1 8 -102 


The initial cash is zero by default, so initial wealth is also zero in this case. (We may change 
it through the argument initial.cash.) As the table shows, we bought one unit in the second 
period at a price of 102. This position was then held until the final period, when it was valued at 
110, leading to a final wealth of 8. 

That we only bought at the second price is a result of the burn-in argument b, which defaults to 
one, and so we lose one observation. In the particular case here, we do not rely on the past, and so 
we may set b to zero. We now buy at the first price and hold until the end of time (i.e. the end of 
data). 


> bt.results0O <- btest (prices = prices, 
Sipe = ae wrayetesem(()) aL; 
D = ©) 

> trade_details(bt.results0, prices) 


price suggest position wealth cash 


1 101 1 1 O -101 
2 102 1 1 1 -101 
3 103 1 1 2 -101 
4 104 1 1 3 101 
5 105 1 dl: 4 -101 
6 106 al iL. 5 -101 
7 107 1 1 6 -101 
8 108 1 t 7 -101 
9 109 1 1 8 -101 
10 110 1 il 9 -101 


Note that now we explicitly named the arguments in the function call. btest has more than 20 ar- 
guments, though almost all are optional. So naming arguments is the preferred way to call the 
function. 

As we said, btest returns a list of several components, one of which is a journal of the trades. 
It may be extracted with the function of the same name, journal. 


> journal (bt.results) 


instrument timestamp amount price 
al asset 1 2 1 102 


1 transaction 


> journal (bt.results0) 


instrument timestamp amount price 
J; asset 1 1 1 101 


1 transaction 


We will see later how to make the journal more informative by passing timestamp and instru- 
ment information. Similarly, you may extract the position by calling the function position. 
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> position(bt.results) 


[,1 
0) 


GOO WMWAN AU PWN FP 


H 


Before we move on, a remark on data frequency. We have not made any specific assumption 
about data frequency: that is because the code makes none, so any frequency (intraday, daily, 
monthly, ...) is fine. btest will not care of what frequency your data are or whether your data are 
regularly spaced; it will only loop over the observations that it is given. It is the user’s responsibility 
to write signal—and other functions—in such a way that they encode a meaningful trade logic. 

What other functions? When btest runs its loop along the prices, at every iteration, it goes 
back to the two questions we stated above: 1) Do I want to compute a target portfolio? If yes, 
do it. 2) Do I want to close the gap between the actual portfolio and the target portfolio? If yes, 
rebalance. The answers to these questions come from the functions signal, do.signal, and 
do.rebalance. 


Function arguments 


We have already seen signal: it uses information until and including t-1 and returns the sug- 
gested portfolio (a vector) to be held at t. This position should be in units of the instruments. If 
you prefer a signal function that returns weights, set convert .weights to TRUE. Then, the 
value returned by signal will be interpreted as weights and will be automatically converted to 
position sizes. 

The function do . signal is supposed to answer the first part of the first question: should we 
compute a signal at all? do. signal uses information until and including t-1 and must return 
TRUE or FALSE to indicate whether a signal (i.e. new suggested position) should be computed. 
This is useful when the signal computation is costly and should only be done at specific points in 
time. If the function is not specified, it defaults to function() TRUE. Instead of a function, 
do.signal may also be 


a vector of integers, which then indicate the points in time when to compute a position, or 

a vector of logical values, which then indicate the points in time when to compute a position, or 
a vector that inherits from the class of timestamp (e.g. Date), or 

a keyword such as firstofmonth or lastofmonth (in this case, timestamp must inherit 
from Date or be coercible to Date). 


The function do. rebalance is just like do. signal, but refers to the actual trading. If the 
function is not specified, it defaults to function() TRUE. Note that rebalancing can typically 
not take place at a higher frequency than implied by signal. That is because calling signal 
leads to a position, and when this position does not change (i.e. signal was not called), there is 
actually no need to rebalance. So do. rebalance is normally used when rebalancing should be 
done less often than signal computation, e.g. when the decision whether to trade or not is condi- 
tional on something. 

For completeness’s sake: btest takes two more functions as inputs. One is print .info, 
which is called at the end of an iteration. Whatever it returns will be ignored since it is called for 
its side effect: print information to the screen, into a file or into some other connection. The other 
function is cashflow. It is also called at the end of each iteration; its value is added to the cash 
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position. The function provides a clean way to add accrued interest to or subtract fees from the cash 
account. 

The default is to specify no arguments to these functions. But we see next that they may, never- 
theless, access useful information. 


More-useful examples 


Let us make the strategy more selective. The trading rule is to have a position of 1 unit of the asset 
whenever the last observed price is no higher than 105, and to have no position otherwise. 


> signal <- function() { 
mic (Clesse) <= 105) 
ll 
else 
0 
} 


> trade_details(btest (prices, signal), prices) 


price suggest position wealth cash 


d 101 0 0 0 0 
2 102 1 1 0 -102 
3 103 Al I 1 -102 
4 104 1 1 2 -102 
5 105 ll al 3 -102 
6 106 all 1 4 -102 
7 107 0 0 5 5 
8 108 0 0 5 5 
9 109 0) 0 5 5 
10 110 0) 0 5 5 


If you like to write clever code, you may as well have written signal this way: 
> signal <- function() 


Close() <= 105 
> trade_details(btest(prices, signal), prices) 


price suggest position wealth cash 


ale 101 0 0 0 0 
2 102 ai I 0 -102 
3 103 all 1 I =102 
4 104 1 al 2 -102 
5 105 I 1 3-102 
6 106 1 a 4 -102 
7 107 0 0 5 5 
8 108 0) 0 5 5 
9 109 0) 0 5 5 
10 110 0 0 5 5 


The logical value of the comparison Close() <= 105 would be converted to either 0 or 1. 
But the more verbose version above is clearer.’ In the example, we—apparently—use the close 
price, but we do not access the data directly. Instead, we call a function Close. This function is 
defined by btest and passed as an argument to signal. Note that we do not add Close as 
a formal argument to signal; it is done automatically. In fact, adding it would trigger an error 
message: 


8. To cite Brian Kernighan: “Everyone knows that debugging is twice as hard as writing a program in the first place. So if 
you’re as clever as you can be when you write it, how will you ever debug it?” 
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> btest (prices, function(Close = NA) 1) 


Error in btest(prices, function(Close = NA) 1) 
‘Close’ cannot be used as an argument name for ‘signal’ 


There is not only Close that can be accessed within signal. We also have these objects: 
Open open prices 
High high prices 
Low low prices 
Close close prices 
Wealth the total wealth (value of positions plus cash) at a given point in time 
Cash cash (in accounting currency) 
Time current time (an integer) 
Timestamp the timestamp when that is specified (i.e. when the argument timestamp is sup- 
plied); if not, it defaults to Time 
Portfolio the current portfolio 
SuggestedPortfolio the currently suggested portfolio 
Globals an environment 


All these objects, with the exception of Globals, are functions. Globals is an environment, 
which can be used for storing data persistently. The listed functions may be called from within 
signal, and also from within the other functional arguments to btest (do. signal, do.re- 
balance, and so on). Because of that we call them inner functions. Inner functions can only read; 
there are no replacement functions, so you cannot change the close prices, for instance. 

All inner functions take as their first argument a lag, which defaults to 1. So to get the most 
recent close price, say 


> Close() 


which is the same as Close(lag = 1). 
The lag can be a vector, too: the expression 


> Close (Time() :1) 


for instance will return all available close prices. So in period 11, say, you want close prices for lags 
10, 9, ..., 1. Hence, to receive prices in their correct order, the lag sequence must be decreasing. If 
you find it awkward to specify the lag in reverse order, you can use the argument n instead, which 
specifies to retrieve the last n data points. So the above Close (Time () :1) is equivalent to 


> Close(n = Time() ) 
and saying 
z Close = L) 


will get you the last ten closing prices. The function Time is particularly useful when it comes 
to accessing other data. Suppose our strategy were based not only on the last close price, but on a 
moving average. Specifically, let us buy when the last close is above the average of the previous 
k prices. Let us fix k at 3, which means we also need to set b to 3. 


= Sicmael <= Eeoa 4 
k <- 3 
ma <- sum(Close(n = k))/k 
if (Close() > ma) 
a 
else 
0 
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} 
> trade_details(btest (prices = prices, 
signal = signal, 
i SB) 5 
prices) 


price suggest position wealth cash 


T 101 0 NA NA 0 
2 102 0 NA NA 0 
3 103 0) 0 0 0 
4 104 I ail 0 -104 
5 105 1 T 1 -104 
6 106 i 1 2 -104 
7 107 1 L 3 -104 
8 108 1 T 4 -104 
9 109 1 dl: 5 -104 
10 110 I 1 6 -104 


This implementation is straightforward, but it may be improved in two ways. First, the value of k. 
Hard-coding a variable in the code is bad practice. (Just as it was bad practice above to hard-code the 
price level of 105 within signal.) Instead we should add the parameter as an argument to signal, 
whose value we pass via the . . . argument of btest. Thus, we need to name the argument in the 
function call. 


> signal <- function(k) { 
ma <- sum(Close(n = 
if (Close() > ma) 
fh 
else 
0 


k))/k 


} 
> trade_details(btest(prices, signal, b = 3, k = 3), prices) 


price suggest position wealth cash 


dl 101 0 NA NA 0 
2 102 0 NA NA 0 
3 103 0 0 0 0) 
4 104 all J; 0 -104 
5 105 A aL 1 -104 
6 106 1 I: 2 -104 
7 TOZ 1 il 3 -104 
8 108 1 T 4 -104 
9 109 1 il 5 -104 
10 110 1 1 6 -104 
> trade_details(btest(prices, signal, b = 5, k = 5), prices) 


price suggest position wealth cash 


pi 101 0 NA NA 0 
2 102 0 NA NA 0 
3 103 0 NA NA 0 
4 104 0 NA NA 0 
5 105 0) 0 0 0 
6 106 Al 1 0 -106 
7 107 T i. 1 -106 
8 108 all 1 2 -106 
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9 109 tk 1 3 -106 
10 110 1 It 4 -106 


The second improvement is in favor of speed. Since we know all prices, we may precompute the 
moving average and pass it as well. Note how we use Time () to access the current value of ma. 


> ma <- MA(prices, 3, pad = NA) 
> signal <- function(ma) { 
if (Close() > ma[Time()]) 
1 
else 
0 
J 
> trade_details(btest (prices, signal, b = 3, ma = ma), prices) 


price suggest position wealth cash 


1 TOL 0 NA NA 0 
2 102 0 NA NA 0 
3 103 0 0 0 0 
4 104 A al O -104 
5 105 Al. als 1 -104 
6 106 1 1 2 -104 
7 107 1 i 3 -104 
8 108 1 1 4 -104 
9 109 1 I 5 -104 
10 110 1 t 6 -104 


We now know enough to revisit MA_crossover. Here is the function again. 
> MA_crossover 
function(m, S) { 


m.fast <- MA(S, m[1], pad = NA) 
m.slow <- MA(S, m[2], pad = NA) 


crossover <- function() { 
if (m.fast[Time()] > m.slow[Time()]) 
1 
else 
0 
} 
tail(btest(S, signal = crossover, 


b = 60, initial.cash = 100, 
convert.weights = TRUE) Swealth, 
n= 1) 


} 


The function first computes the two moving averages; within the signal function, the current 
values of these moving averages are then accessed with Time () . The backtest is run and its equity 
curve (wealth) is extracted, of which the final value (tail(..., 1)) 1s returned. 


More examples 
If we want to trade a different size, we have signal return the desired value. 


> trade_details(btest (prices, signal = function() 5), 
prices) 
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price suggest position wealth cash 


al 101 0) 0 0 0 
2 102 5 5 0 -510 
3 103 5 5 5 =510 
4 104 5 5 10 -510 
5 105 5 5 dS: +510 
6 106 5 5 20 -510 
4: 107 5 5 25 -510 
8 108 5 5 30° =510 
9 109 5 5 35: s510 
10 110 3 5 40 -510 


An often-used way to specify a trading strategy is to map past prices into +1, 0 or -1 for going 
long, flat or short. A signal is then often only given at one specified point (as in “buy one unit 
now”). Example: suppose our rule said “buy after the third period”. To access the current time, we 
have already seen that the function Time can be used. 


> signal <- function() 
We (Cn) == Bip) 
1 else 0 
> trade_details(btest (prices, signal), prices) 


price suggest position wealth cash 


1 101 0 0 0 0 
2 102 0) 0 0 0 
3 103 0) 0 0 0 
4 104 1 1 0 -104 
5 105 0 0 J I 
6 106 0 0 ah, 1 
7 107 0 0 1 1 
8 108 0 0 1 1 
9 109 0 0 1 1 
10 110 0 0 1 1 


But this is not what we wanted. If the rule is to buy and then keep the long position, we should 
have written it like this. 


> signal <- function() 
aie (Grains ()) == Sik) 
1 else Portfolio() 
> trade_details(btest (prices, signal), prices) 


price suggest position wealth cash 


1 101 0 0 0 0 
2 102 0) 0 0 0 
3 103 0 0 0 0 
4 104 T i 0 -104 
5 105 1 l 1 -104 
6 106 1 1 2 -104 
y 107 1 T 3 -104 
8 108 I 1 4 -104 
9 109 1 t 5 -104 
10 110 I ale 6 -104 


The function Port folio evaluates to the previous period’s portfolio. Like Close, its first argu- 
ment sets the time lag, which defaults to 1. 
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We may also prefer to specify signal so that it evaluates to a weight; for instance, after a 
portfolio optimization. 


> signal <- function() 
## invest 50% 


ORS) 


In such a case, you need to set convert .weights to TRUI 


initial wealth: 50 percent of nothing is nothing. 


> bt <- btest (prices, 


signal, 
convert.weights 
initial.cash 


> trade_details(bt, prices) 


price suggest position wealth 
0. 
-495 
-490 
-488 
-486 
. 483 
.481 
-479 
.476 
.474 


101 
102 
103 
104 
105 
106 
107 
108 
109 
110 


POAN A UPWNE 


(2) 


OOOO. OV OVO" e G 


000 


0 


SO SaaS SS es E a 


000 
495 
490 
- 488 
-486 
. 483 
.481 
.479 
.476 
.474 


of wealth in asset 


sh 


= TRUE, 
100) 

ca 
100 100. 
100 49. 
100 50. 
LOT. -503 
101 50. 
102 50. 
LO: 5I; 
103. 51; 
WOZ TS lee 
104 51. 


ANTON ONANWNN OW OO 


E. We also need to have a meaningful 


Now something interesting has happened: the function traded in every period from the second. That 
is because we never invested exactly 50%, which in turn happened because we never knew the price 
at which we would execute the trade. (Which is the case in practice.) In a backtest we could “cheat,” 
of course, by looking ahead; and btest would allow you to do that (though it will never happen 


by default). 


But a much better way is to check the actual trade sizes: 


> journal (bt) 


instrument 


asse 
asse 
asse 
asse 
asse 
asse 
asse 


OAAIT AH FWY PP 


asse 


asser 


(va D (Gi NEE oh CL Sf 


ct 


ct ct 


timestamp 


9 transactions 


CO WAN HU PWD 


H 


amount 


0.49505 


=O; 
=O 
-0. 
=O: 
-0. 
= 0:5 
=O. 
-0. 


00485 
00236 
00233 
00230 
00227 
00224 
00221 
00218 


price 
10 


} 
H-O OO O G O O 
OO WANA UH PWD 


The trades will become very small. So small, in fact, that in practice you would likely not 
execute them because of transaction costs. To handle such cases, we may either rewrite signal so 
that it checks whether it should trade or not. For many simple (and common) cases, as here, btest 
has an argument do. rebalance, which also is a function. It should evaluate to TRUE if we want 


to trade, and FALSI] 


> dont_if_small <- 


E otherwise. 


function () 


{ 


diff <- SuggestedPortfolio(0) 
abs (diff) 


> 5e-2 


- Portfolio() 
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} 
> bt <- btest (prices, 
signal, 
convert.weights = TRUE, 
idee Gasin = i100), 
do.rebalance = dont_if_smal1l) 
> trade_details(bt, prices) 


price suggest position wealth cash 


1 101 0.000 0.000 100 100.0 
2 102 0.495 0.495 100 49.5 
3 103 0.490 0.495 100 49.5 
4 104 0.488 0.495 101 49.5 
5 105 0.486 0.495 101 49.5 
6 106 0.483 0.495 102 49.5 
7 107 0.481 0.495 102 49.5 
8 108 0.479 0.495 103 49.5 
9 109 0.477 0.495 103 49.5 
10 110 0.475 0.495 104 49.5 


do. rebalance is called after signal. Hence, the suggested position is known and the lag 
may be zero (SuggestedPortfolio(0)’). 

The tol argument to btest works similarly: it instructs btest to only rebalance when at 
least one absolute suggested change in any single position is greater than tol. Default is 0.00001, 
which practically means always rebalance. 

The argument tol .p has a slightly different effect: it restricts rebalancing to those trades that 
lead to position changes of more than 100*tol.p percent. 


> bt <- btest(prices, 
signal, 
convert.weights = TRUE, 
aici casa = i100), 
tol = 5e-2) 
> trade_details(bt, prices) 


price suggest position wealth cash 
1 101 0.000 0.000 100 100.0 
2 102 0.495 0.495 100 49.5 
3 103 0.490 0.495 100 49.5 
4 104 0.488 0.495 101 49.5 
5 105 0.486 0.495 101 49.5 
6 106 0.483 0.495 102 49.5 
7 107 0.481 0.495 102 49.5 
8 108 0.479 0.495 103 49.5 
9 109 0.477 0.495 103 49.5 
10 110 0.475 0.495 104 49.5 


That basically concludes the tutorial for btest. Now let us use a real dataset. 


15.3.2 Robert Shiller’s Irrational-Exuberance data 


Going to a backtest with real data creates a dilemma. On the one hand, we think that computation 
and its purpose should not be separated—recall the Analyst from Chapter 1—, and working with 
real data is much more illuminating and interesting. 

On the other hand, this book is about tools for financial computation, and the purpose of this 
chapter is to describe software and techniques; the purpose is not to do empirical analysis. But 
in the process of backtesting, most of the time must be scheduled first for checking, cleaning and 
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preparing data, and then for analyzing results—crucial tasks, but simply not the topic of this chapter 
(nor of this book). Still, we decided to include examples based on real datasets. Just keep in mind 
that the discussions are not meant as complete case studies whose results could be applied as is. 

Robert Shiller has collected aggregate data for U.S. stocks for the period 1871 until today, 
which he used in his research papers and, notably, in his book “Irrational Exuberance” (Shiller, 
2000, 2015). The dataset, which he regularly updates, is available from his homepage at http:// 
www.econ.yale.edu/~shiller/, The NMOF package comes with a convenience function Shiller 
that downloads the data and arranges them in a data frame.” 


> library ("NMOF" ) 
> data <- Shiller(dest.dir = "~/Downloads/Shiller") 
> str(data) 


‘data.frame’: 1777 obs. of 10 variables: 

$ Date : Date, format: "1871-01-31" 

$ Price >: num 4.44 4.5 4.61 4.74 4.86 4.82 4.. 
$ Dividend : num 0.26 0.26 0.26 0.26 0.26 0.26 

$ Earnings : num 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0... 
S CPI ¿ num 12.5 12.8 13 12.6 12.3 

$ Long Rate ; mum 5.32 5.32 5.33 5.33 5733 

$ Real Price > num 89.6 88.1 88.9 94.9 99.5 

$ Real Dividend: num 5.24 5.09 5.01 5.2 5.33 

$ Real Earnings: num 8.07 7.83 7.71 8.01 8.19 

S CAPE : num NA NA NA NA NA NA NA NA NA NA 


We will also use packages zoo (Zeileis and Grothendieck, 2005), for handling time-series, and 
PMwR. 


> library ("PMwR" ) 
= lorera (TZO) 


Shiller’s dataset comprises, among other variables, a monthly total-return series of the S&P 
Composite Index!® and a time-series of aggregate earnings for the stocks included in the index. 
Taking the ratio of these two series gives us total-market price/earnings, or P/E, ratio, i.e. a measure 
of the valuation of the stock market as a whole. Shiller averages earnings over a period of ten years 
(see Shiller, 1996), for which he took inspiration from Graham and Dodd (1934). The result is 
usually referred to as the Cyclically-Adjusted P/E, or CAPE, ratio. We extract these two series and 
store them as zoo objects price and CAPE. We also store the dates as a vector timestamp. 


> timestamp <- data$Date 

> price <- scalel(zoo(data$Price, timestamp), 
level = 100) 

> CAPE <- zoo(data$CAPE, timestamp) 


Figs. 15.11 and 15.12 provide plots of both series. The code for both graphics is essentially the 
following. 


> plor (sii, 


Alelo = Vu, yleis = ErP Compose? , 
xaxt = a yaxt = Tia log = Mey 
ivory = PIM, dwel = @)) 


9. A note for Windows users: the symbol ~ in the function call stands for the user’s home directory, which is essentially 
equivalent to C: \USers\<username>\, though this has varied between Windows versions. To use it in R on Win- 
dows, the safest way is to set an environment variable HOME to the desired path. In R, you may query the value of ~ with 
path.expand("~"). 

10. Today, this means the S&P 500. The S&P 500 index dates back to the 1920s, though it was computed from fewer 
(less than one-hundred) companies back then. Accordingly, it was not called S&P 500, but S&P Composite. Only in 1957 
reached the number of stocks in the index 500. See https://www.britannica.com/topic/SandP-500 . 


[shiller-data] 


[packages] 


[extract-data] 


[shiller1] 
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FIGURE 15.11 Time-series from Robert Shiller’s dataset: S&P stock market index since January 1871. The series starts at 
100. The vertical axis uses a log scale. The two vertical lines indicate high values of the CAPE ratio in the past (Fig. 15.12): 
September 1929 and December 1999. Both times were close to market tops. 
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FIGURE 15.12 Time-series from Robert Shiller’s dataset: Cyclically-adjusted price/earnings ratio (CAPE ratio). The two 
vertical lines indicate high values in the past: September 1929 and December 1999. Both times were close to market tops. 


= alollime(la = E00, arareies(2)))) z 

Iwel = 0.35, Col = erer) 
exis (2, iwa = @) 
t <- axis.Date(1, x = data$Date, lwd = 0) 
aolinalvy = tt, kwel = 0.25, cel = cieay(.7))) 
Inia (price type = VIL") 
abline(v = as.Date(c("1929-09-30", 

WAU A= sil) )) 
Goll = Gracy (0). 5) )) 


ENE NE ONE NE 


Fig. 15.12 shows that the aggregate valuation of stocks, when measured as a multiple of real- 
ized earnings, fluctuates substantially over time: from below 10 to more than 40 at its peak in the 
late 1990s. This fluctuation primarily stems from changes in the price series; the earnings series, 
because it is smoothed, varies much less. This leads to the idea that the CAPE ratio may give in- 
dications as to when the market is overvalued. By this theory, because earnings are rather stable, 
a high ratio means that prices have overshot their fundamental value, and vice versa. And indeed: 
after peaks in the CAPE, the markets typically performed poorly: see 1929 and 2000, which we 
have marked in Figs. 15.11 and 15.12. 

But, as we argued above, such graphical reasoning can easily be deceiving. We could run a 
straightforward backtest: whenever the CAPE is below 25, we stay fully invested in the index; when 
above 25, we sell and stay out of the market. But we cannot do that, of course. We could never 
have known that 25 would be a high level for the CAPE until much later. So we write signal 
differently: we move out of the market when the CAPE exceeds the 90th quantile of its known 
history. 


> avoid_high_ valuation <- fun 
Q <- quantile(CAPE[n = 
if (CAPE[Time()] > O) 
0 
else 
1 
} 
> bt_avoid_high_valuation <- 
btest (price, 
Signal = avoid_hi 
aliglilicaleul ,ceisla: = 10 
convert.weights = 
CAPE CAPiir 
timestamp = times 
b = as.Date("1899 


Remarks: first, in the signal function, named avoid_high_valuation, we compute the 
quantile on the fly. Alternatively, we might have precomputed it for the whole series, as we did in 
the previous section for the moving average. Second, we specify the burn-in b as a date, which is 
often more convenient than specifying a number. This is only possible, because—third—we specify 
another argument, timestamp, when we call btest. Passing the timestamp has a number of 
benefits. For instance, if you want to see how the strategy fared, it is easiest to extract the equity 


ction(CAPE) { 


WSA I, O58, sero e 
gh_valuation, 

0, 

TRUE, 

tamp, 


SOSS 
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= TRUE) 


curve with as .NAVseries, which now displays the timestamps. 


==> 31 Jan 2019 
37653.3 


31 Dec 1899 
100 


> as.NAVseries (bt_avoid_high_valuation) 


(1430 data points, 


0 NAs) 
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For NAVseries, there exists an informative summary method (we suppress some of its output). 


> summary (as.NAVseries (bt_avo 


id_high_valuation) ) 


31 Dec 1899 ==> 31 Jan 2019 


(1430 data points, 


0 NAs) 


(31 Jul 2016) 
(30 Sep 1900) 


(30 Apr 1930) 
(30 Jun 1932) 
(31: Aug 1951) 


100 37653.3 
High 37653.29 
Low 95.10 
Return (%) KoA 
Max. drawdown (%) 79.1 
_ peak 487.77 
_ trough 101.89 
_ recovery 
_ underwater now (%) 0.0 
Volatility (%) 13.9 
_ upside 10.6 
_ downside 9-2 


When timestamp information is available, the actual dates are shown when we extract the trades 


with journal. 


> journal (bt_avoid_high valuation) 


[avoid-high- 
valuation] 


[merge-series] 


[series-ratio] 


[series] 
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Similarly, we could have passed a character vector instrument, giving the actual asset names. 
If you check the journal, you see that the strategy traded quite often to adjust the actual position. 
If you rerun the backtest with a restriction on trade size (e.g. by using arguments tol or tol .p), 
you will see that these trades did not substantially affect the performance. 

Now let us compare the resulting equity curve with the underlying stock market, i.e. a buy- 
and-hold investment. Because we will do this for other strategies later on, we define two small 
tools (read: functions). The first function, merge_series, takes any number of zoo, btest or 
NAVseries objects, and synchronizes them, i.e. it joins them so that their timestamps match. The 
result is a multivariate zoo series. 


> merge_series <- function(..., series.names) { 
e @= disic(.ss)) 
if (missing(series.names)) { 
if (!is.null(ns <= names(s)) ) 
series.names <- ns 
else 
series.names <- seq _along(s) 
} 
EO ZOO <- function e) { 
Wie (udoesrnes Eo T ocese i 1 
as.zoo(as.NAVseries (x) ) 
} else if (inherits(x, "NAVseries")) { 
as.zoo (x) 
l eles im (aiiallaveteieiss (62, Vzor) 
x 
} else 
stop("only zoo, btest and NAVseries are supported") 
} 
s < lapply(s, to_zoo) 
series <- do.call(merge, s) 
if (is.null(dim(series) ) ) 
series <- as.matrix(series) 


series <- scalel(series, level = 100) 
colnames(series) <- series.names 
series 


} 


The second function, series_ratio, computes the ratio of two series. This is useful to see when 
one strategy performed better than another (Schumann, 201 3b). 


> series_ratio <- function(t1, t2) { 
ase (meere (ie) }} 
Scale (ell, VEL Al leval = 10) 
else 
scalel(t1/t2, level = 100) 
} 


So let us use these tools. We first merge the backtest with the S&P data. 


> series <- merge_series ( 
"btest" = bt_avoid_high_valuation, 
"S&P" = price) 


From these series, we create two plots, shown in Figs. 15.13 and 15.14. Fig. 15.14 shows the 
ratio of the strategy’s equity curve and the benchmark. When the numerator (i.e. the strategy) 
performs better than the denominator (i.e. the benchmark), the line rises; when both series grow 
at the same rate, the line is flat; and when the benchmark performs better than the strategy, the 
line declines. For a long time the line is flat, which happens when the strategy is invested in the 
market. In the 1929 crash the strategy actually did well, though not at once: it first underperformed 
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FIGURE 15.13 Results of avoiding high valuation. The S&P index is shown in black; the strategy in gray. 
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FIGURE 15.14 Results of avoiding high valuation compared with S&P 500. 


by about 20%, as the strategy went out of the market and the market kept on rising. When the stock 
prices plummeted, however, the strategy was vindicated and left with an outperformance of more 
than 30%. That is, an investor in the strategy would have had a 30% higher terminal wealth than 
an investor in the benchmark. In the 1990s, however, the strategy often was not invested, yet the 
market failed to crash. Hence, the line declines. 

These results are somewhat sobering. The original figures (15.11 and 15.12) had indicated that 
high valuations were a sign of market tops, but our strategy created from the valuation measure 
failed to profit substantially, despite substantial declines in the market in the years that followed the 
tops. The strategy got out of the market, but then went back in too quickly, notably in the 1930s; 
and it clearly missed out on large parts of the bull markets since the 1990s. 

Let us see how our choice of the 90th quantile affects the results. Before you object: the goal is 
not to find some optimal level for the quantile, but to get an impression of the influence the choice 
has. To do this, we first make the quantile an argument to the signal function. 


> avoid_high_ valuation_q <- function(CAPE, q) { 
Q <- quantile(CAPE[n = Time()], gq, na.rm = TRUE) 
if (CAPE[Time()] > Q) 
0 
else 
ill 


} 
Then we fix a vector of values for q. 


> og values <= C06 On nS 
Ssec(@.90, 0.99; oy = 0.0L) 


The straightforward way would be now to loop over these values, each time call btest and store 
the results we are interested in. The order in which we run these backtests is not relevant, as the tests 
are independent from one another. This suggests that we may distribute the computation. There are 


[avoid-high- 
valuation-q] 
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several ways in which we may do this, and we outline a simple one.!! First, for each variation of 
the backtest that we want to run, we collect all arguments to the call to btest ina list args. 


[collect-args] > args <- vector("list", length(q.values) ) 
> names(args) <- as.character(q.values) 
> for (q in q.values) 
args[[as.character(q)]] <- 
list (coredata(price), 
signal = avoid_high_valuation_q, 
Wines «Casa = AL00,, 
convert.weights = TRUE, 
CAPE = CAPE, 
q = eh 
timestamp = timestamp, 
b = as.Date("1899-12-31") ) 


It is easy to memorize this pattern of setting up the parallel computation because it resembles calling 
btest ina loop: only instead of calling btest, we pack all arguments that we would have used 


into a list. 
[compare-methods ] > ++ seriall 
> variations_q_serial <- vector("list", length(q.values) ) 
> names (variations_q_ serial) <- as.character(q.values) 
> for (q in q.values) 
variations_q_serial[[as.character(q)]] <- 
btest (coredata (price), 
signal = avoid_high_valuation_q, 
Lilet asia = 100, 
convert.weights = TRUE, 
CAPE = CAPE, 
q=q, 
timestamp = timestamp, 
b = as.Date("1899-12-31")) 
> ## parallel 
> library ("parallel") 
> ## parallel: socket cluster 
> cl <- makePSOCKcluster(rep("localhost", 2) ) 
> ignore <- clusterEvalQ(cl, require ("PMwR") ) 
> variations_q parallell <- 
clusterApplyLB(cl, args, 
function(x) do.call(btest, x)) 
> names (variations_q parallell) <- as.character(q.values) 
> stopCluster (cl) 
> ## parallel: fork cluster 
> cl <- makeForkCluster(nnodes = 4) 
> ignore <- clusterEvalQ(cl, require("PMwR") ) 
> variations_q parallel2 <- clusterApplyLB(cl, args, 


function(x) do.call(btest, x) 
) 
> names(variations_q parallel2) <- as.character(q.values) 
> stopCluster (cl) 


Compare the results. 


> all.equal(variations_q_serial, variations_q_parallel1) 


11. A downside of this approach is that we may make copies of all the data, and move these data to the node at every call. 
An appendix to this chapter outlines alternative strategies. 


Backtesting Chapter | 15 457 


50000 


20000 
10000 
5000 


2000 
1000 
500 


200 
100 
50 


1900 1950 2000 


FIGURE 15.15 Performance of strategy for different values of q. The S&P index is shown in black; the strategy variations 
in gray. 


[1] TRUI 


ea 


> all.equal(variations_q_serial, variations_q_parallel2) 


[1] TRUI 


JEJ 


Such computations are useful so often that btest already implements this distribution of back- 
tests. So instead of the setup we described above, we could have used the following code (which 
uses a fork cluster; on Windows, say method = "parallel" instead). 


> variations_q <- 
btest (coredata(price), 

signal = avoid_high_valuation_q, 
iiMsiereulL casn = 100, 
convert.weights = TRUE, 
CAPE = CAPE, 

timestamp = timestamp, 

b = as.Date("1899-12-31"), 


Welicneveems = list(G = Ce yvaltes), 
variations.settings = 
list (method = "multicore", 


cores = 4, 
label = as.character(q.values) ) ) 
> all.equal(variations_q, variations_q_ serial) 


[1] TRUI 


GI 


Whether we use a loop or not, we end up with a list of backtests. We extract the equity curves of 
these backtests and plot them. The results are shown in Fig. 15.15. 


> series_var <- do.call(merge_series, variations_q) 


(Such a one-line piece of code should make clear why it was useful to prepare tools such as 
merge_series.) 

In Fig. 15.15 all gray lines are above the black market line, which shows that for all values 
of q, we would have done somewhat better (at least not worse) than the market in the 1929 crash. 
However, this outperformance is eroded and in some cases lost in the 1990s. If we look at the whole 
series, we find that leaving the market when valuations are truly extreme led to good results. In this 
way the strategy avoided parts of the two crashes, while at the same time it stayed long enough in 
the market before the markets dropped. We can see this most easily by comparing final wealth, i.e. 
total returns. 


[variations-q] 


[series-var] 
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[returns-total] > returns(window(series[, "S&P"], 
Sterge = els. Dece VLO- sya? )} 
end = as.Date("2017-12-31")), 
period = "itd") ## inception to date 


37585.1% [31 Jan 1901 -- 31 Dec 2017 


> returns (window(series_var, 
Starti alse Date W Oab—loib sil?) 
end = as.Date("2017-12-31")), 


periodi = Pater!) 
0.6: 11021.1% 31 Jan 1901 -- 31 Dec 2017 
0.7: 10343.3% 31 Jan 1901 -- 31 Dec 2017 
0.8: 13635.0% 31 Jan 1901 -- 31 Dec 2017 
0.9: 32379.7% 31 Jan 1901 -- 31 Dec 2017 
0.91: 32120.6% 31 Jan 1901 -- 31 Dec 2017 
0.92: 29116.0% 31 Jan 1901 -- 31 Dec 2017 
0.93: 29741.7% 31 Jan 1901 -- 31 Dec 2017 
0.94: 34895.9% 31 Jan 1901 -- 31 Dec 2017 
0.95: 45656.4% 31 Jan 1901 -- 31 Dec 2017 
0.96: 49066.2% 31 Jan 1901 -- 31 Dec 2017 
0.97: 41583.1% 31 Jan 1901 -- 31 Dec 2017 
0.98: 40555.6% 31 Jan 1901 -- 31 Dec 2017 
0.99: 39991.6% 31 Jan 1901 -- 31 Dec 2017 


Waiting for extreme valuation level was necessary in the 1990s: stay in the market for as long as 
you could. But this behavior was less advantageous in the 1930s. 


> returns (window(series[, "S&P"], 
start = as.Date("1920-12-31"), 
end = as.Date("1950-12-31")), 
joxeren.erch = Uaiiec!)) 


190.0% [31 Dec 1920 == 31 Dec 1950] 


> returns (window(series_var, 
start = as.Date("1920-12-31"), 


end = as.Date("1950-12-31")), 
periodie) 
0.6: 380.1% 31 Dec 1920 -- 31 Dec 1950 
0.7: 355.9% 31 Dec 1920 -- 31 Dec 1950 
0.8: 231.9% 31 Dec 1920 -- 31 Dec 1950 
0.9: 277.8% 31 Dec 1920 -- 31 Dec 1950 
0.91: 238.8% 31 Dec 1920 -- 31 Dec 1950 
0.92: 238.8% 31 Dec 1920 -- 31 Dec 1950 
0.93: 238.8% 31 Dec 1920 -- 31 Dec 1950 
0.94: 249.4% 31 Dec 1920 -- 31 Dec 1950 
0.95: 308.1% 31 Dec 1920 -- 31 Dec 1950 
0.96: 336.7% 31 Dec 1920 -- 31 Dec 1950 
0.97: 290.5% 31 Dec 1920 -- 31 Dec 1950 
0.98: 190.0% 31 Dec 1920 -- 31 Dec 1950 
0.99: 190.0% 31 Dec 1920 -- 31 Dec 1950 


Backtesting Chapter | 15 459 


You see that now the finding is somewhat reversed: waiting until the last moment was not the 
most beneficial strategy in the 1920s. 

We will leave it at that. The moral: first, it is not easy to move from a potentially valuable 
indicator, such as stock-market valuation, to a profitable investment strategy. Second, as we recom- 
mended above, one should try to understand why a strategy works. In the case of Robert Shiller’s 
data, the strategy gained its advantage from moving out of the market at critical times; these results 
are not refutable. But the performance gain comes from only two episodes that occurred over a 
period of 150 years, which is not a lot of observations. As we said above, analyzing such results 
and drawing conclusions from it is not easy: leaving the market too early can, in relative terms, be 
just as costly as staying in when the market crashes. Alan Greenspan made his famous irrational 
exuberance remarks in 1996 (after a testimony by Robert Shiller and John Campbell). 

When we look at the S&P (with dividends included) between January 1996 and 2018 mrmr" 
the index never saw its 1996-level again, not even in 2000-02 or 2007-08. 

Finally, we should stress that the example does not imply that the CAPE ratio “does not work” 
as an indicator of when to invest (which is something that cannot be shown in any case). But the 
example demonstrates that a picture may be much more promising than the actual implementation. 


15.4 Backtesting portfolio strategies 


(That is, we look at multi-asset backtests.) 


15.4.1 Kenneth French’s data library 


Kenneth French collects, updates, and makes available a treasure trove of datasets of U.S. American 
equity markets on his website at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/. Many of 
these data are derived from the CRSP database (http://www.crsp.com/). In this section, we work 
with a dataset comprising industry returns, as described in Fama and French (1997). 

We first load the data. The NMOF package provides a convenience function French for down- 
loading and processing the data. As before, we will make much use of the zoo package for handling 
time-series. 


> library ("NMOF" ) 

> library ("PMwR" ) 

= Ilveraimy ("zo") 

> P <- French("~/Downloads/French", 
dataset = "49 TIndustry_Portfolios_daily_CSV.zip", 
weighting = "value", 
frequency = "daily", 


price.series = TRUE, 
na.rm = TRUE) 


We will use only a subset of the series, starting in 1990. We also drop the 49th industry (“Other”). 
The subset was only chosen to make the analysis more tractable; you may easily adjust the code to 
work with the full history (or some other dataset). The return series are transformed to a price series 
and stored as a zoo series P. We store the timestamp separately. Table 15.1 presents summary 
statistics of the single series. 


> START <- as.Date("1990-01-01") 
END <- as.Date("2018-07-31") 


Vv 


## make zoo series 

P <- zoo(P, as.Date(row.names(P) ) ) 

P <- window(P, start = START, end = END) 
timestamp <- index(P) 

P <- P[, colnames(P) != "Other"] 


WN Ny NY OY 


> short <- colnames (P) 


[french-data] 


[prepare-data] 


460 PART | III Optimization 


TABLE 15.1 Statistics of the industry series. Returns are annualized, in percent. Volatility is com- 
puted from monthly returns and also annualized. 


Short name Return Volatility Short name Return Volatility 
in % in % in % in % 

Softw 14.6 25.3 A Bussv 10.4 17.0 pene 
Guns 14.1 20.5 seme R 10.4 13.8 pee 
Aero 14.1 20.4 a PBI 10.3 21.3 E 
Ships 13.6 25.5 eee ite 10.5 Aa eee 
MedEq 13.1 16.6 ee Hshld 29) 14.4 ao 
Smoke 13.1 23.0 o ių Boxes 9.7 20.5 Ee 
ElcEq 127; 21.4 pe Oil 9.6 18.6 ae 
Fin 12.4 23.5 eee Cnstr 9.4 23.7 eee 
Beer 12.4 16.4 See Util 9.3 13.6 eee 
Rtail 12.3 17.0 es Paper 9.3 S a ee 
Chips 12.0 DHS ee ead Mines 9.0 26.9 ooa 
Fun 12.3 26.4 pare Whilsl 9.1 15.7 ese 
Drugs 12.1 15.9 ae Hlth 8.5 2I E 
LabEq 2A IDS pee ead PerSv 7.6 20.0 See ee ae 
Meals 11.8 16.4 pene er Telcm 7.4 17.4 e 
Soda Why 24.0 ne Rats Txtls 6.6 TR) = 
Insur Ove: 18.0 ee Books 6.6 18.9 
Rubbr 11.3 19.8 ere Autos 6.2 25.9 
Chems 10.9 19.1 ener Toys 6.3 Paley 
Trans 10.9 17.9 ae FabPr 5.7 26.1 
Hardw 10.6 VHS ee Steel 5.6 28.3 eae ir 
Clths 10.8 DS ee O 3.8 42.1 r 
Banks 10.6 Di ed RIEst 3.0 2579 
Mach 10.3 22.6 N Gold —0.7 37.9 

> instrument <- colnames (P) 

> defs <- French("~/Downloads/French", "Siccodes49.zip") 

> data.frame(Abbr = colnames(P), 

Description = defs[match(colnames(P), defsSabbr), "industry 


4 )) 


Abbr 
1 Agric 
2 Food 
3 Soda 
4 Beer 
5 Smoke 
6 Toys 
cf Fun 
8 Books 
9 Hshld 
10 Clths 
11 Hlth 
12 MedEq 
13 Drugs 
14 Chems 


Description 
Agriculture 
Food Products 
Candy & Soda 
Beer & Liquor 
Tobacco Products 
Recreation 


Entertainment 
Printing and Publishing 
Consumer Goods 

Apparel 
Healthcare 


Medical Equipment 
Pharmaceutical Products 
Chemicals 
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15 Rubbr Rubber and Plastic Products 
16 Txtls Textiles 
17 BldMt Construction Materials 
18 Cnstr Construction 
19 Steel Steel Works Etc 
20 FabPr Fabricated Products 
21 Mach Machinery 
22 ElcEq Electrical Equipment 
23 Autos Automobiles and Trucks 
24 Aero Aircraft 
25 Ships Shipbuilding, Railroad Equipment 
26 Guns Defense 
27 Gold Precious Metals 
28 Mines Non-Metallic and Industrial Metal Mining 
29 Coal Coal 
30 Oil Petroleum and Natural Gas 
31. Util Utilities 
32 Telcm Communication 
33 PerSv Personal Services 
34 BusSv Business Services 
35 Hardw Computers 
36 Softw Computer Software 
37 Chips Electronic Equipment 
38 LabEq Measuring and Control Equipment 
39 Paper Business Supplies 
40 Boxes Shipping Containers 
41 Trans Transportation 
42 Whlsl Wholesale 
43 Rtail Retail 
44 Meals Restaurants, Hotels, Motels 
45 Banks Banking 
46 Insur Insurance 
47 R1Est Real Estate 
48 Fin Trading 


We start by looking at the data (a.k.a. plotting). The results are shown in Fig. 15.16. 


> plot(scale1l(P, level = 100), [industries-series] 
plot type = Veainc sla , 
log = "y", 
COL = Crean. Colors (mE (12) )) 5 
sellsloy = 0 
yileig = ow) 


There clearly was an uptrend in prices; and there also was substantial variation in the cross-section 
of industry returns. But apart from that, it is difficult to read the chart, as the different lines can 
hardly be discerned from one another. A useful alternative way to plot such a collection of series is 
a fanplot. The following code block shows how it may be constructed. 


> P100 <- scalel(P, level = 100) ## P must be ‘zoo’ [fanplot] 
> P100 <- aggregate (P100, 
datetimeutils: :end_of_month(index(P100)), 
wamik iL) 
> nt <- nrow(P100) 
> levels <- seq(0.01, 0.49, length.out = 10) 
> greys <- seq(0.9, 0.50, length.out = length(levels) ) 


> ### start with an empty plot 
> plot (index(P100), rep(100, nt), ylim = range(P100), 
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FIGURE 15.16 Performance of industry portfolios as provided by Kenneth French’s data library, January 1990 to May 
2018. Time-series are computed from daily returns. 
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FIGURE 15.17 Fanplot of the industry time-series shown in Fig. 15.16. 


slei = vu, ylab = 00. 
hey = Ol, 
type = Lene 
log = Mayall ) 

> ### ... and add polygons 


> for (level in levels) { 


1 <- apply(P100, 1, quantile, level) 
u <- apply(P100, 1, quantile, 1 - level) 
col <- grey(greys[level == levels] ) 
polygon(c(index(P100), rev(index(P100))), 
(dk, setene((ul)) )) 5 
Gol = coll, 
border = NA) 
} 


The results, shown in Fig. 15.17, look much clearer. And as you will see later, when we add 
other benchmarks, the implied trend is very reasonable. Of course, such code should not be run as 
a script, so we move the code into a function fan_plot, which we will reuse later. 

Besides the total returns of the different industries, we should also be interested in how they 
move together. For the sake of not making an already long chapter even longer, we limit ourselves 
to computing pairwise correlations. But analyzing co-movement between the assets is an important 
question, and there are different ways to answer it. In fact, it would be best to receive and compare 
several answers, e.g. from factor analysis, or PCA. 
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FIGURE 15.18 Correlations of monthly returns. 


> C <- cor(returns(P, period = "month") ) [correlations] 
= Cors < C llorer emi (CI 
> summary (cors) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.012 0.379 0.503 0.488 0.616 0.854 


A histogram of these correlations is shown in Fig. 15.18. 
Now that we have seen the data, let us create benchmarks. First, also from Kenneth French’s data 
library, we take a total-market index. We store it as a zoo series named market. 


> ### ... market [market] 
> ££3 <- French("~/Downloads/French", 

"F-F_ Research_Data_Factors_daily_CSV.zip", 

frequency = "daily", 


price.series = TRUE, 
na.rm = TRUE) 
> market <- zoo(ff£3[["Mkt-RF"]]*ff£3[["RF"]], 
as.Date(row.names(ff3) )) 
> market <- scalel(window(market, start = START, end = END), 
level = 100) 


A second useful benchmark is an equally weighted portfolio. (We shall from now refer to this 
strategy as EW.) Computing it will be our first multivariate use of btest. 


An equally weighted portfolio 


In the previous section we saw that we can access prices in the signal function with Close. For 
more than one asset, btest works just the same, only Close() will now give us prices for all 
assets. More specifically, it will return one row of the prices matrix. So, for an equally weighted 
portfolio, the signal function is straightforward. 


> ew <- function() { [ew] 
n <- ncol(Close() ) 
Tajo) (AL //ial,, ia) 
} 


Note that we could have saved a little code (and computing time) by making the number of assets 
n a constant to be passed, or by precomputing w. But running the backtest takes less than a second, 
so this seems unnecessary. 


> bt.ew <- btest (prices = list(coredata(P)), [bt-ew] 
signal = ew, 
donsiogna li = Milesecumuemmicia” 


convert.weights = TRUE, 
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FIGURE 15.19 Benchmarks: Performance of market portfolio and of equally weighted portfolio. The gray shades are the 
same as in Fig. 15.17 and indicate the range of performance of the different industries. 
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FIGURE 15.20 Benchmarks: Outperformance of market versus equally weighted portfolio. A rising line indicates the 
outperformance of the market portfolio, and vice versa. 


TELE casa = 100, 


Is) = 250), 
timestamp = timestamp, 
instrument = instrument) 


Note the value of do.signal: btest will extract the last trading day of each month in 
timestamp and compute and rebalance on those dates. 


Fig. 15.19 shows the market series and also the performance of the equally weighted portfolio. 
See also Fig. 15.20. 


15.4.2 Momentum 


The abnormally high “returns of buying winners and selling losers”, a.k.a. momentum returns, 
were described in Jegadeesh and Titman (1993) and have proved to be perhaps the most widely 
replicated anomaly. Moskowitz and Grinblatt (1999) report evidence that cross-sectional industry 
momentum could explain much of the previously described single-stock momentum. In this section, 
we are going to test such a strategy on the industry portfolios obtained from Kenneth French’s Data 
Library. The purpose of this section is to demonstrate the functionality of the btest function in a 
multi-asset backtest, and also show how backtests may be analyzed and evaluated. 


The setup 


Suppose we wish to invest in the 10 industries that have had the highest returns over the past year. 
We want to give equal weight to each industry. We have already collected the total return series in a 
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matrix P. The momentum computation for a given point in time is conveniently put into a function 
mom: 


> mom <- function(P, k) { [moni 
o <- order(P[nrow(P), ]/P[1, ], decreasing = TRUE) 
w <- numeric(ncol (P) ) 
wlo[l:k]] <- 1/k 
w 


} 


The computation assumes that the first row of P contains the price one year ago; and the last row 
contains the most recent price. The function does not care whether we have data of daily or monthly 
or any other frequency, nor does it care about the number of rows in P. 

We can best demonstrate how mom works through a small example. Suppose we have just three 
assets, with returns 10%, 20%, and 30%. We wish to be equally weighted in the best two assets, i.e. 
kis 2." 


> B3 <= matrise i , i p il, [mom-test ] 
lei 2y li Shae 
miO = 2, lowacioyy = REA 0/sa)) 
> mom(P3, k = 2) 


[20.0 0.55° 0:5 
Suppose we had wanted to invest only in the asset with the highest return, i.e. k is 1. 


> mom(P3, k = 1) 


[1] 001 


Let us test mom with more realistic data: we use the first 250 rows of the dataset. Note that 
in the following code block, it is necessary to say coredata(P). See Appendix 15.B as to 
why. 


> mom.latest <- mom(head(coredata(P), 250), 10) 
> table(mom.latest) 


mom. latest 


0.0.2 
38 10 
> df <- data.frame(sector = instrument, 


weight = mom.latest) 
> df[dfSweight > 0, ] 


sector weight 


2 Food Ona 
4 Beer 0.1 
5 Smoke OT 
12 MedEq 0.1 
13 Drugs Oi i. 
26 Guns 0.1 
syi Util 0.1 
35 Hardw (O 


12. The example illustrates an important advantage of collecting all computations in a function: we can easily write test 
cases for the function—an activity that is crucial in practical applications. 


[signal-fun] 


[bt-mom] 
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FIGURE 15.21 Performance of momentum strategy. The vertical axis uses a log scale. 


36 Softw 0.1 
40 Boxes 0.1 


The function mom itself follows a simple pattern: take as input the data matrix (prices, sorted 
in time) and optionally other parameters; return the target portfolio. Now we need to make btest 
understand mom. Recall that btest takes an argument signal: a function that returns the target 
position. A generic implementation of signal takes a function and its parameters as inputs. 


= signals <— functalon (ume) et 
P <- Close(n = 250) 
UIE, sos) 
} 


You may wonder, why such a nested structure? The reason is code reuse: mom is a function that 
is in no way related to btest, and might be used separately. signal, in turn, is responsible for 
setting up the data P and then calling fun; but it does not need to know what fun actually does. 
With signal defined, we may run btest. 


> bt.mom <- btest(prices = list(P), 
Signal = signal, 
dorchignal— ihastormontiy, 
convert.weights = TRUE, 
aligalicaielll e casa = ILO), 


ig = iLO, 

EUA = ion, 

ley = 250, 

timestamp = timestamp, 
instrument = instrument) 


The equity curve of bt .mom is shown in Fig. 15.21. We can summarize the backtest results 
with summary. 


> summary (as.NAVseries(bt.mom), na.rm = TRUE) 


26 Dec 1990 ==> 31 Jul 2018 (6953 data points, 0 NAs) 
100 6345.63 
High 6487.71 (26 Jan 2018) 


Low 93.57 (09 Jan 1997) 
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FIGURE 15.22 Performance of momentum strategy. The vertical axis uses a log scale. The gray line shows the absolute 
performance of the market. The black line shows the relative performance of momentum compared with the market, i.e. a 
rising line indicates an outperformance of momentum. 


Return (%) 16.2 (annualised) 
Max. drawdown (3%) 52.8 
_ peak 2952.01 (23 Jun 2008) 
_ trough 1394.19 (09 Mar 2009) 
_ recovery (17 Jan 2013) 
_ underwater now (%) 262 
Volatility (%) 15.9 (annualised) 
_ upside 13.5 
_ downside 9.6 


Overall, there is little doubt that the momentum portfolio performed well. But it should also be of 
interest when it performed will. We first merge the backtest results with the benchmarks. 


> series.mom <- merge_series ( [series .mom] 
Momentum = bt.mom, 
"Equal-weight" = bt.ew, 


Market = market) 


Now we may use series_ratio to plot the ratio of the equity curves, in order to visualize 
the relative performance. The result is shown in Fig. 15.22. The graphic shows the ratio of the 
strategy’s equity curve and the benchmark. When the numerator (i.e. the strategy) performs better 
than the denominator (i.e. the benchmark), the line rises; when both series grow at the same rate, 
the line is flat; and when the benchmark performs better than the strategy, the line declines. 

Fig. 15.22 illustrates that the strategy performed well in 2000-03, and also in the run-up to the 
financial crisis, during which it underperformed the market. Since then, during the “new normal” 
boom, it has performed roughly in line with the market. 


= plot (series.mom, [momentum] ] 
ploc: = YMsumele, Loc = “vy, 
col = c(grey(0), col.market, col.ew), 
ylab = paste("Growth of USD 100 since", 
as.character(attr(series.mom, 
eee ileil @rcdepatia™)) } }) } 
> abline(v = attr(series.mom, "Scalel_origin") ) 


> plot(series.mom[, "Market"], log = "y", [momentum2 ] 
cal = grey (55), wileic) = Vieereitemneinee", ilala = YY) 
> ## lines(series.mom[, "Momentum" ] ) 


[long-short] 
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> lines (series_ratio ( 
series.mom[, c("Momentum", "Market")]), 
ylab = "Performance Momentum/Market", xlab = "") 
> abline(v = attr(series.mom, "scalel_origin") ) 


Sensitivity 
We emphasized before that any backtest depends on many settings. Variables for momentum are: 


e The data: is the performance driven by only a few assets, or only by specific time periods? 

e The size of the portfolio (k), and also its weighting scheme. 

e The choice of rebalancing period: do we have to switch to the target portfolio quickly, or does 
rebalancing not matter much? 

e The lookback period. 


We are not going to go through each variable, but limit ourselves to principles and some exam- 
ples. Computationally, testing variations is straightforward. Suppose we wished to investigate the 
influence of the lookback period. We would rewrite signal to take a parameter hist. We could 
then fix interesting values for hist and run a loop. 

We may also want to check the interaction between different inputs, which would require nested 
loops: one loop for each variable of interest. Since the number of combinations explodes quickly 
in such cases, we can often only (randomly) sample from these combinations. 

In any case, the result of such an analysis will be a collection of backtests, i.e. a collection 
of out-of-sample paths. So for any computation we would have done pathwise, we now have a 
distribution; e.g., a distribution of returns or volatilities. The first task then is to explore these 
distributions. A robust strategy should not show a breakdown of performance along some paths, 
and also not very variable performance. Very interesting is also to compare families of settings. For 
instance, for momentum, one family may comprise backtests that capture short-term momentum 
(i.e. often rebalanced), and the other family long-term momentum. 

But let us return to the process of computing the backtests. Even if we look only at a single input, 
it may take a lot of time to compute results. We may reduce the wallclock time by distributing the 
computations. 

Let us run a concrete example: let us see the effect of the rebalancing frequency on the returns of 
the momentum strategy. The following code chunk simulates 200 rebalancing schemes: the 100 re- 
balance frequently, with 5 to 25 days between rebalancing. The second 100 schemes rebalance less 
frequently, with 26 to 300 days between rebalancing. Note that neither scheme includes transaction 
fees. 


> library ("parallel") 

> runs <- 100 

= elope enoe <=> syeroreoue (Vilacsie™, eaen = ae) 
> args_long <- vector("list", length = runs) 
> for (i in seq_len(runs)) { 


when <- cumsum(c(250, sample(5:25, 5000, replace=TRU 
when <- when[when <= length(timestamp) ] 


zal 
~ 


args_short[[i]] <- list(prices = list(coredata(P)), 
snemal = semel, 
do.signal = when, 
convert.weights = TRUE, 
iigaLeakeul , Geysla = JL 00), 


k = 10, 
fun = mom, 
ley = 550), 


## tc = 0.0025, 
timestamp = timestamp, 
instrument = instrument) 


when <- cumsum(c(250, 


sample(26:300, 
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5000, 


when <- when[when <= length(timestamp) ] 


args_long[[i]] <- list(prices = 


> cl <- makePSOCKcluster(rep("localhost", 
require ("PMwR") ) 


> clusterEvalQ(cl, 


fan 

q 
zs] 
a 
(za 


= 

qJ 
ys) 
E: 
(zal 


IB 

q 
id 
eq 
(za 


= 

J 
sl 
(= 
(zal 


> variations1 <- clusterApplyLB(cl, 


> variations2 < clusterApplyLB(cl, 


> stopCluster (cl) 


Vv 


mom_vars1 <- do.call(merge_series, 
> mom_vars2 <- do.call(merge_series, 


list (coredata(P)), 


siemal = Savefayel , 
do.signal = when, 
convert.weights = TRUE, 
Wigietell casa = 100, 

Ie = (0), 

iUia = eT 

Is) = B50, 

## tc = 0.0025, 
timestamp = timestamp, 
instrument = instrument) 


4)) 


args_short, 
function (x) 


args_long, 
LENSE aL Cia (5:<)) 


variations1) 
variations2) 


replace=TRUI 


do.call(btest, 


do.call(btest, 
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CI 


x) ) 


x) ) 


Results are summarized in Figs. 15.23 to 15.25. We get a clear picture: rebalancing often per- 
forms on average better than waiting longer. But note that this is not the final verdict. After all, we 
did not consider transaction costs (though at the level of industries, as we look at the strategies, 
they will not matter too much, since they are cap-weighted). 


5000 
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FIGURE 15.23 Results from sensitivity check: Fanplots of paths with frequent rebalancing (the upper band) and less 


frequent rebalancing (the lower band). 


[prototypes] 
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FIGURE 15.24 Results from sensitivity check: The ratio of the median paths of the fanplots in Fig. 15.23. The steadily 
rising line indicates that frequent rebalancing outperforms less frequent rebalancing. 
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FIGURE 15.25 Results from sensitivity checks: densities of annualized returns of the fanplots in Fig. 15.23. The distribu- 
tion to the right belongs to the paths with more frequent rebalancing. 


15.4.3 Portfolio optimization 


This section has three goals: i) show how a portfolio model, such as one of those discussed in 
Chapter 14, may be backtested with btest; ii) show the sensitivity of such a model to the specific 
settings under which it is run, and iii) show the effects of using a non-exact optimization method (a 
heuristic) instead of an exact one. 

We will discuss a relatively simple model: computing a low-risk portfolio, with risk being the 
variance. The advantage of this model is that it can be solved exactly via quadratic programming. 
Thus, we may compare the exact solution with the one obtained by a heuristic. 


The setup 


Within the signal function there are now two things to do: first, receive historical data and com- 
pute a forecast of the variance—covariance matrix. We collect these computations in a function 
cov_fun. Second, given a variance—covariance matrix, and perhaps other restrictions, compute 
the MV weights. These computations will go into function mv_fun. We may thus write down 
prototypes for both functions: 


= COW bUm <= iwiayelesomi(IR, coc) d 
## 
} 
> mv_fun <- function(var, wmin, wmax) { 
## 
} 


The reason for separating these computations is that we may want to try different methods for 
both computations. The actual signal function is provided in signal_mv. As before, its job is 
to fetch the data via Close(), and then call functions cov_fun and mv_fun. One point to 
mention is the parameter n: we pick prices every n days and compute returns. Below, we set n 
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to 20. In this way, we at least mitigate spurious “uncorrelatedness”, which results from even small 
asynchronicities in the industry series. !* 


> Eicma my <= CnC (Cow Eon, y iE, [signal-mv] 
wmin, wmax, n, ...) { 


cov_fun .. takes a matrix R of returns 
(plus ...), and evaluates to 
the variance--covariance matrix 
of those returns 


mv_fun .. takes a covariance matrix and 
min/max weights, and returns 
minimum-variance weights 


se S$ FF OH OH H H HE 


P <- Close(n = 2500) 

i <- seq(1, nrow(P), by = n) 
R <- PMwR::returns(P[i, ]) 
Gy <= CENA JUHA, 2 oo) 
mv_fun(cv, wmin, wmax) 


} 


The default mv_fun uses the function QP solver from the quadprog package. Accordingly, we 
call the function mv_qp. (There is also a function named minvar in the NMOF package, which 
computes the minimum-variance portfolio.) 


> mv_qp <- function(var, wmin, wmax) { [mv-qp] 


na <- dim(var) [1L] 

A <- rbind(1, -diag(na), diag(na) ) 

bvec <- rbind(1, 
array(-wmax, dim = c(na, 1L)) 
acre win, Ci = eaa ilu)})) 


quadprog: :solve.QP ( 
Dmat = 2evar, 
dvec = rep(0, na), 
Amat = t(A), 
bvec = bvec, 
meq = 1L)$solution 
} 


The mv_qp function is not connected to btest; we may conveniently use it for other cases. 
And we may test it. For instance, let us compute the long-only MV portfolio for the first 5 years 
(approx. business 1250 days) of the data. 


> library ("quadprog") [qp-test] 
> R <- returns(head(P, 1250), period = "month") 

> var <- cov(R) 

> w <- mv_qp(var, wmin = 0, wmax = 0.2) 


13. You may also notice that we have chosen a fairly long historical window of about 10 years. We have about 50 assets, and 
so we end up with two and a-half times as many observations to compute a variance—covariance matrix. This choice is purely 
for expositional purposes here; practically, we might as well have chosen a shorter window. The trade-off is between bias 
and variance: if we use a shorter window, we will pick up more recent variation more quickly, but at the price of less-stable 
estimates. Choosing a longer window will make estimates more stable, but not necessarily more accurate: they may be more 
biased. Specifically for the dataset here, Fama and French (1997) report that factor loadings are varying strongly over time. 
A longer time horizon will produce more stable numbers, but these numbers are not necessarily better forecasts of future 
variance. 


[bt-mv] 
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FIGURE 15.26 Performance of the long-only minimum-variance portfolio (gray) vs market (black). On Nov 19, 1999, 
both series have a value of 100 (the MV time-series starts later because of its burn-in of 10 years). 


As a plausibility check, we look at the 10 sectors with the lowest volatility: they make up more than 
half the portfolio. 


> df <- data.frame(vol = apply (R, 2, sd), 
weight = round(100+w, 2)) 
> sum(df[order(df$vol) [1:10], ]) 


[1] 55.6 


Computing the minimum-variance backtest 


We run the backtest. Results are shown in Fig. 15.26. We rebalance at the end of each quarter. 
To compute the covariance, we use R’s cov function. In an actual application, such a choice 
should be a point of criticism. There is ample empirical evidence that the forecasts of variances 
may be improved; for instance, by way of shrinkage methods. Thus, more interesting would be 
to test alternative functions, such as those in packages robustbase (Maechler et al., 2018), 
RiskPortfolios (Ardia et al., 2017) or BurStFin (Burns, 2014). But as we said at the start 
of the section, the main purpose will be to compare an exact method with a heuristic, for given 
inputs. 


> bt.mv <- btest(prices = list(coredata(P)), 
Salemi = Salejayell_ imi, 
do.signal = "lastofquarter", 
convert.weights = TRUE, 
magulicaell cacian = 100, 
Is) = BESO), 
Cow itu ECON? 
mv_fun 


ne = 207 
timestamp = timestamp, 
instrument = instrument) 


Summary statistics for market and bt.mv summary may be computed by coercing both to 
NAVseries and calling summary. The summary statistics show that MV had a higher return 
(which is already visible from the figure) together with a lower volatility and a lower drawdown. 
This result is in line with other papers that analyzed low-risk portfolios in the US equity market, 
such as Chan et al. (1999) and Clarke et al. (2006). 
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> print (summary (window(as.NAVseries(market, title = "Market"), 
Siceucic = els: Deice( Vises) sil) y 
sparkplot = FALSE, monthly.returns = FALSE) 

Market 

31 Dee 1999 ==> 31 Jul 2018 (4675 data points, 0 NAs) 
515.676 1522523 

High 1540.38 (25 Jul 2018) 

Low 278.65 (09 Oct 2002) 

Return (%) 6.0 (annualised) 

Max. drawdown (%) eye T 

_ peak 643.82 (09 Oct 2007) 

_ trough 291.77 (09 Mar 2009) 

_ recovery (13 Mar 2012) 

_ underwater now (%) 1.2 

Volatility (%) 14.9 (annualised) 

_ upside 10.8 

_ downside 10.4 


> print (summary (window(as.NAVseries (bt.mv, 


title = "Minimum Variance"), 
Starti kacr pate GONER 
sparkplot = FALSE, monthly.returns = FALSE) 


Minimum Variance 


31 Dec 1999 ==> 31 Jul 2018 (4675 data points, 0 NAs) 
100 486.064 

High 514.90 (26 Jan 2018) 

Low 86.95 (23 Jul 2002) 

Return (%) 8.9 (annualised) 

Max. drawdown (%) 42.3 

_ peak 207.02 (10 Dec 2007) 

_ trough 119.39 (09 Mar 2009) 

_ recovery (04 Apr 2011) 

_ underwater now (%) 56 

Volatility (%) 11.6 (annualised) 

_ upside 9.0 

_ downside 7.8 

Sensitivity 


As we pointed out, sensitivity analysis is one key job when doing backtests. Many parameters and 
settings could and should be checked for the described backtest. One such setting is the method 
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of computing the variance—covariance matrix, and also the way in which the data were prepared. 


[summaries] 


[when-rebalance] 
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We passed a parameter n, so we may check what influence comes from choosing the holding 
period for returns. Important is of course the model and its restrictions: imposing minimum and 
maximum weights may have much influence on the results. Another parameter is the historical 
window, which we have chosen quite long. That brings the advantage that the computed variance— 
covariance matrix is fairly stable, at the price of not, or only slowly, picking up recent changes in 
the market. 

Let us repeat the sensitivity check we did with the momentum strategy. We rebalance at ran- 
domly chosen intervals between one week (5 days) and about half a year (125 days). 


library ("parallel") 

runs <- 100 

args_rnd_rebalance <- vector("list", length = runs) 

for (i in seq_len(runs)) { 
when <- cumsum(c(2501, sample(5:125, 5000, replace=TRUE) ) ) 
when <- when[when <= length(timestamp) ] 


= 
> 
= 
= 


args_rnd_rebalance[[i]] <- 
list(prices = list(coredata(P)), 
signal = signal_mv, 
do.signal = when, 
convert.weights = TRUE, 
Tigao oeeksla, = ILO), 


D = 2500 
timestamp = timestamp, 
instrument = instrument) 
F 
> cl <- makePSOCKcluster(rep("localhost", 4)) 
> clusterEvalQ(cl, require("PMwR") ) 


1] TRUE 
2 
1 TRUE 
3 
1] TRUE 
4 
1 TRUE 
> variations_rnd_rebalance <- 


clusterApplyLB(cl, args_rnd_rebalance, 
function(x) do.call(btest, x)) 
> stopCluster (cl) 
> series_rnd_rebalance <- do.call(merge_series, 
variations_rnd_rebalance) 


As Fig. 15.27 shows, there is quite some variation in the results, which becomes more salient 


when we compute actual numbers. 
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FIGURE 15.27 Performance of 100 walk-forwards with random rebalancing periods. The portfolios are computed with 
QP. All randomness in the results comes through differences in the setup; there is no numeric randomness. 


> cat("Annualized volatilities along price paths:\n") 


Annualized volatilities along price paths: 


> summary (16*apply (returns (series _rnd_rebalance), 2, sd)) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.147 0.148 0.149 0.149 0.149 O. L521 


> cat("Annualized returns along price paths:\n") 


Annualized returns along price paths: 


> summary (returns (series _rnd_rebalance, period = "ann") ) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.0852 0.0914 0.0959 0.0957 0.0995 0.1068 


We see that realized volatilities vary in a range of half a percentage point, which does not seem 
too much in the equity world. But annual returns vary by more than two percentage points, and 
that is substantial. (And of course, we ran only 100 paths. With more paths, the range would almost 
certainly widen.) This variation, we should stress, appears because of different drifts (realized mean 
returns) of the chosen portfolios; other than that, the portfolios are similar: in Fig. 15.28, we plot the 
daily returns of the first four rebalancing variations. Their correlations are close to 1. That should 
not be surprising, as the actual positions are very similar, as Fig. 15.29 shows for a single asset. 

Let us stress it: the variability in outcomes is entirely caused by variations in our settings; more 
specifically, the choice of when to rebalance. All computations are strictly deterministic; there is 
no chance involved. 


Using Local Search 
A second MV function, mv_1s, which uses Local Search. 


> mv_ls <- function(var, wmin, wmax) { 


na <- dim(var) [1L] 

if (length(wmin) == 1L) 
wmin <- rep(wmin, na) 

if (length(wmax) == 1L) 


[mv-ls] 
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FIGURE 15.28 Correlation of daily returns when portfolios are rebalanced at random timestamps. 


wmax <- rep(wmax, na) 


-neighbour <- function(w) { 
scepsis <= eimie (ih, atin = O, ier = ©, 11) 
toSell <- which(w > wmin) 
toBuy <- which(w < wmax) 
i <- toSell[sample.int(length(toSell), size 1L) ] 
j <- toBuy[sample.int(length( toBuy), size = 1L)] 
stepsize <- runif(1) * stepsize 


stepsize <- min(w[i] - wmin[i], wmax[j] - wljl, 
stepsize) 

wli] <- wli] - stepsize 

w[j] <- w[j] + stepsize 

WwW 


.pvar <- function (w) 
w %x% var %*% w 


NMOF: :LSopt(.pvar, 


list (neighbour = .neighbour, 
HAO) krepi narina) 
awt = eh OO OOF 
printBar = FALSE, 
printDetail = FALSE) )$xbest 
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FIGURE 15.29 Positions in a single industry across four backtest variations. The backtests differ only in their rebalancing 
schedules; hence the positions are very similar. 


We compare the differences in weights, in basis points. 


> w.qp <- mv_qp(var, a Ge) 
>w.ls <- mv_ls(var, oa) 
> summary (10000x (w.qp - w.ls)) 


0.0 
0.0 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
-2.06 0.00 0.00 0.00 0.00 4.54 


For the most part, they are very small, with some outliers. (But keep in mind that we may invest up 
to 10% into a single sector, so 5 basis points is not much.) Let us see the impact of these deviations. 
We run 100 backtests with mv_l1s. 
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FIGURE 15.30 Performance of 100 walk-forwards with fixed rebalancing periods. The portfolios are computed with Local 
Search. All variations in the results stem from the randomness inherent in Local Search (though it is barely visible); there 
are no other elements of chance in the model. Compare with Fig. 15.27. 


[variations-ls] > variations ls <= 
btest (prices = list(coredata(P)), 
Signal = signal_mv, 
do.signal = "lastofquarter", 


convert.weights = TRUE, 
mareial easan = OO), 

DE- 2510107 

Cone Ebi = RCONT 

mv_fun 


moe 20; 
timestamp = timestamp, 
instrument = instrument, 
replications = 100, 
variations.settings = 
list(method = "parallel", 
cores = 10)) 


The results are shown in Fig. 15.30. There is variation in the paths, but it is so tiny that it is 
barely visible in the figure. 


> series _variations_ls <- do.call(merge_series, variations_l1s) 
> cat("Annualized volatilities along price paths:\n") 


Annualized volatilities along price paths: 
> summary (16*apply(returns(series_variations_ls), 2, sd)) 


Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.146 0.146 0.146 0.146 0.146 0.146 


> cat("Annualized returns along price paths:\n") 
Annualized returns along price paths: 


> summary (returns (series _variations_ls, period = "ann") ) 


CADNHWNKE 
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Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.0881 0.0882 0.0882 0.0882 0.0882 0.0883 


Appendix 15.A Prices in btest 


The prices are passed as argument prices. For a single asset, this must be a matrix of prices 
with four columns: open, high, low and close. For n assets, you need to pass a list of length 
four: prices [[1]] must be a matrix with n columns containing the open prices for the assets; 
prices [[2]] isa matrix with the high prices, and so on. For instance, with two assets, you need 
a list of four matrices with two columns each: 


If only close prices are used, then for a single asset, use either a matrix of one column or a numeric 
vector. For multiple assets a 1ist of length one must be passed, containing a matrix of close prices. 
For example, with 100 close prices of 5 assets, the prices should be arranged in a matrix p, say, of 
size 100 times 5; and prices = list(p). 


Appendix 15.B Notes on zoo 
Recall that in the section on momentum, we tested mom with the tail of the dataset. 
> mom.latest <- mom(tail(coredata(P), 250), 10) 

We did this because zoo will match timestamps. This would not have worked: 


S io eani (ey 250), IL) 


[1] 00000000 0000000000000 0 0 0 
[26] 00000000 0000000000000 0 


O O 


The matching of timestamps happened when we used /. If you look into Ops . zoo, you would see 
this: 


> merge(el, e2, all = FALSE, retclass = NULL) 
> NextMethod (.Generic) 


In other words, before zoo does any of the elementary operations, it will merge the operands 
on the time-series, and drop those observations with NA values. In many circumstances this is 
extremely convenient behavior. Suppose for instance you wanted to compute the P/E ratio for a 
stock, and lack data for certain periods. But it may cause surprises at times. A simple example may 
make the behavior clearer. 


= 200l iy = zoo 5) 


Data: 
numeric (0) 


Index: 
numeric (0) 


[zoo] 


[zoo-merge] 


[zoo-integer] 
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Since we are already at it: another gotcha when using zoo is integer timestamps. 


= zO; 58110) 


= Zoll, 5210) [5] 


= 00 SLO S 


Appendix 15.C Parallel computations in R 
15.C.1 


In this section, we discuss some of the basics of parallel computations with package parallel. 
We will restrict the discussion to one subset of parallel computation, namely distributed computing, 
and we will stay in the context of backtesting. See also Section 12.B. 

Distributed computing is straightforward: split a large computation into smaller ones, and dis- 
tribute these subcomputations to several workers; then, collect the results from the workers and 
combine them. Perhaps the simplest example is computing the sum of many numbers. Group the 
numbers into several subsets, give these subsets to workers that then compute the subset-sums; and, 
finally, add the subset-sums. The example already makes clear a trade-off: we save time because 
several workers do their job in parallel; but we lose time when we distribute the tasks, and collect 
and combine the results. The effort required for such “administrative” operations is called overhead. 
For computations that take very little time on a modern computer (such as, incidentally, computing 
a sum of numbers), distributing does not help because the overhead is too large. As a rule: the best 
way to find out whether a parallel computation actually saves time (and if so, how much) is to run 
experiments. 

Distributed computing has several advantages: it is simple, and often scales well: if the overhead 
is small compared with the time the actual computation requires, the speedup is basically linear. 
Double the number of workers, and the computation time halves. 

Cleve Moler called easily distributable computations “embarrassingly parallel”; such computa- 
tions are everywhere in finance: 


Distributed computing 


e Monte Carlo simulations; 
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pricing portfolios in which the position values do not depend on one another; 
optimization with population-based methods; 

running restarts for optimization methods; 

general sensitivity analysis. 


15.C.2 Loops and apply functions 


Let us define a simple function, one, which gives us what its name promises. 


> one <- function(...) [one] 


Suppose one actually did something useful, and you wanted to repeat this computation 
1000 times. The reflex is to use a loop. 


> runs <- 1000 [one-loop] 
> ones <- numeric(runs) 
> for (i in seq_len(runs) ) 

ones[i] <- one() 


But a loop misguides us, in a way, since it implies an iterative computation: the first, the second, 
etc. But we care not about the order in which one is called. 
In R, we may use lapp1y instead (or a higher level variant, such as replicate). 


> ones <- lapply(seq_len(runs), one) 


It is not merely a change in syntax: lapply does not make a promise about the order in which 
one is called. 

Package parallel offers several parallel equivalents to Lapply; one of them is called par- 
Lapply. Before we use the function, let us make one slower. 


S Cine <= NCEA aaa x [one-wait] 
Sys.sleep(1) ## wait one second 
IL 


} 
Running one four times should now take just as many seconds (plus a little overhead). 


> runs <- 4 [runs] 
> system. time ( 
for (i in seq_len(runs) ) 
one () ) 


user system elapsed 
0.004 0.000 4.009 


The same is true with lapply. 


> system. time ( 
lapply(seq_len(runs), one) ) 


user system elapsed 
0.00 0.00 4.01 


[one-parallel] 


{sum-xy] 


[args-do-call] 
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But not when run in parallel, on four cores. 


> cl <- makeCluster(4) ## four cores 
> system.time(parLapply(cl, seq_len(runs), one) ) 


user system elapsed 
0.002 0.000 1.004 


> system.time(clusterApply(cl, seq_len(runs), one) ) 


user system elapsed 
0.000 0.001 1.002 


> stopCluster (cl) 


Note that when we call makeCluster, we assume that your machine has four cores. You may 
use function detectCores to find out about your machine. But be sure to read the function’s 
help page and its caveats. 


15.C.3 Distributing data 


Running a computation in parallel may always be split into three parts: distribute data and code to 
the nodes; have the nodes run the computations; and finally collect the results. R will help us with 
the second and the third task; in fact, it will do the whole job for us. That means we are left with 
the task of organizing the computation and distributing it. 

Let us create a new function. 


= Gil ogy <> CEL o 3) 
x+y 


Suppose we want to evaluate the function for different values of x and y. These different values are 
collected in a data frame df. 


= che <= Erqoeiniclencaicls = isa, 32 = 586) 
= Che 


BwWDdNY PR 
DORFPNEWM 
DNnauUuK 


In a serial computation, i.e. with a loop, we would run through the rows of df and call sum_xy 
for each row. 

In R we may “package” the arguments of a function into a list, and then call the function with 
this single list as an argument. (In case you do not realize it: this is a very powerful feature.) 


= egs «<= liste: = i, s = 5) 
= elo eeulil ec sey", ES) 
[1] 6 


The elements in args will be treated and matched as in a standard function call. So for instance 
args <- list(1, x = 2) means that y gets the value of 1. 
A simple strategy is the one we described before: we pack every row of df into a list. 
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> data <- vector("list", length = nrow(df) ) [pack-df] 
> for (i in seq_len(nrow(df) )) 
chealla <- listi: s chisel), y = eksyi] 


Note that in the example here, it would have sufficed to write data [[i]] <- c(df[i, ]) in 
the loop. Calling c has the (documented) side effect of dropping all attributes, including class. 
It is easy to memorize this pattern of setting up the data, because it resembles calling the function 
of interest in a loop, only instead of calling the function, we call List. 
We could now call lapply. 


= iteig Cera; ittmaetonein(z) Clo ceulil (cum soy, Z) [eval-lapply] 


1] 8 


We are as well ready for the parallel version of Lapply. Or almost, at least: we have prepared 
the data, but the nodes are “clean”; they cannot know what sum_xy is. So we tell them, by ex- 
porting the function sum_xy from the master (i.e. the session that starts the other processes) to the 
nodes. 


> cl <- makeCluster (4) [export-fun] 
> clusterExport(cl, "sum_xy") 
> parLapply(cl, data, function(x) do.call(sum_xy, x) ) 


> @llsicecNojoly (Gil, Caren Ceto es) Clo, Cei avm S67, e) 
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> stopCluster (cl) 


Alternatively, we could have sent the code of sum_xy as an expression and evaluate it on each 
node. 


[eval-fun] > cl <- makeCluster (4) 
> ignore <- clusterEvalQ(cl, 
sum_xy <- function(x, y) 
x + y) 


> parLapply(cl, data, function(x) do.call(sum_xy, x) ) 


> stopCluster (cl) 


15.C.4 Distributing data, continued 


In the example, all arguments to the computation were variable, i.e. they changed in every call. But 
suppose that y is fixed. In a backtest, think of price data, which may stay the same over different 
tests. Instead of moving such fixed data around every time, we might as well export it from the 
master to the nodes. 


[export-y] > y.value <- 100 
> x.values <- as.list(1:4) 


Vv 


cl <- makeCluster (4) 
Gulbiicerampgqeroncie e DEEA e) 
ignore <- clusterEvalQ(cl, 
sum_xy <- function(x, y = y.value) 
ae ae SY) 


Vv 


Vv 


> parLapply(cl, x.values, function(x) sum_xy(x, y.value) ) 


1] 101 
2 

1] 102 
3 

Tt] 03 


> stopCluster (cl) 


Backtesting Chapter | 15 485 


Let us create two tiny case studies for distributing a backtest. In one, the price data stays un- 


changed, but we wish to test different parameters. In the other, we wish to run the same strategy on 


two different data sets. 


> ## set up data, functions 


> prices <- 101:110 


> signal <- function(threshold) { 
if (Close() > threshold) 


1 
else 
0 
J 


Vv 


We oY 


cl <- makeCluster 
SClustenbxaomt (Cle 


(4) 


ee sieaa E, 


> ignore <- cluster 


Vv 


## run btest 


threshold.values <- as.list(102:105) 


## create cluster and distribute data, 


Droneakeeys\" )) )) 


Hya (cole, 
library ("PMwR") ) 


> parLapply(cl, threshold.values, 

function (x) 

btest (prices = prices, 

signal = signal, 
threshold = x) ) 


initial wealth 0 => final 


> stopCluster (cl) 


The second example. 


Vv 


initial wealth 0 => final 


initial wealth 0 => final 


initial wealth 0 => final 


## set up data, functions 


> prices < list(prices1 = 
DigiCes la 
> signal <- function() { 


me (GhoSe:()aa> 


105) 


wealth 6 


wealth 5 


wealth 4 


wealth 3 


WOA g ILO 
ZOMO 


functions 


[backtests- 
parallel1] 


[backtests- 
parallel2] 


[eval-files] 
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else 


> cl <- makeCluster(4) ## create cluster 


> clusterExport(cl, ## Gistribute data, functions 
e sieran) 
> ignore <- clusterEvalQ (cl, 
library ("PMwR")) 


> parLapply(cl, prices, ## run btest 
function (x) 
btest (prices = x, 
Signal = signal) ) 


Spricesl 
initial wealth 0 => final wealth 3 


Sprices2 
initial wealth 0 => final wealth 8 


> stopCluster (cl) 


As the examples have shown, there is typically more than one way to do it. In general, in particular 
for larger studies, it pays off to take some time to structure the computations and results, and to 
experiment with different setups. 

One useful idea is to store different specifications of backtests in files. If these are R code files, 
we may then use the parallel functions directly with source. Suppose for each strategy you 
have an R file in a directory Backtesting; then you could evaluate each file with a code snippet 
as this one. 


> files <- dir("~/Backtesting", 
pattern = TATANAN RU 
full.names = TRUE) 
> cl <- makeCluster (4) 
> clusterApplyLB(cl = cl, files, source) 
> stopCluster (cl) 


15.C.5 Other functions in the parallel package 


For an overview of the functionality of the parallel package, see the vignette that comes with 
the package. Since the package is recommended, it is usually installed by default. You can open the 
vignette directly from within R. 


> vignette("parallel", package = "parallel") 


15.C.6 Parallel computations in the NMOF package 


Several functions in the NMOF package have support for distributed computations built in. As of 
package version 1 .5-0, these are GAopt, gridSearch, bracketing and restartOpt. 
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16.1 Term structure models 
16.1.1 Yield curves 


Whenever we read about the yield curve or the term structure of interest rates, chances are that there 
is a diagram that shows the interest rate as a smooth function of time. Such a function is actually not 
observable. Instead, it must be estimated or constructed somehow from what is available: deposits, 
coupon bonds, swaps, and a vast number of other interest-rate products. Sometimes these products 
are quoted in terms of price, as is the case for bonds; sometimes they are quoted in terms of an 
interest rate, as is the case for deposits or interest-rate swaps. The existence of a smooth function 
can be justified by arbitrage. If there were two products with comparable properties such as default 
risk and liquidity but different prices (which are functions of the interest rate), operators would 
exchange one product for the other; thereby the interest rates implied by prices would converge. 


Bond prices and yields 


In this chapter we will explain methods to calibrate yield curve models. For an introduction to the 
mathematics of bonds, yields, and related concepts, see Tuckman (2002) and Luenberger (1998). 
Here we just recall those definitions that we later need. Start with notation: a bond with current 
price bg receives future cash flows c. Subscripts indicate time. Times to payment are called t4, T3, 
..., and are measured in years; T) = 0 is today. We could have simplified notation by assuming that 
all the t are integers. But in practice, they will be fractions of years, and for portfolios they will not 
be equidistant. 

A bond is, for our purposes, a list of payments with associated payment dates. The theoretical 
price bo of the bond today is the present value of these payments: 


bo = da Cr, + dry Cry + dryers +: = Y dacs (16.1) 
i 


in which the d+; are discount factors. Before the financial crisis 2007—08, we would routinely have 
written dz, < 1 for all į, i.e. interest rates cannot go below zero. But even when they do, some 
lower bound is probably reasonable, which then implies an upper limit on discount factors. These 
discount factors can be defined as 
1 n 
d,, = 
ý ( 1+ -) 


Numerical Methods and Optimization in Finance. https://doi.org/10.1016/B978-0-12-815065-8.00028-5 
Copyright © 2019 Elsevier Inc. All rights reserved. 487 


[newton-ytm] 
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1 


with r an interest rate such as 0.05. For convenience, also define d, = (fr 


). The d, are constant, 
independent of payment date. We have 


0 = —bo + drC + dnCn +dycr, ++: (16.2) 
= —bo + d;i! cy +d? cn Hdp Cn +- (16.3) 
8(dx) 
| ee me 
=—bot+ (=>) Cy + (=) Cn ++. (16.4) 


The r that solves (16.4) is called yield-to-maturity, or internal interest rate. The corresponding dx 
must be a zero of (16.3). If we know the price and the cash flows of the bond, we can compute this 
zero with one of the techniques discussed in Chapter | |. If all c are greater than zero, there is a nice 
graphical intuition that there is a unique zero. 
g(d,) 
Let g(d,) be the right-hand side of (16.3) as a function 
of dą. For a d, of zero (i.e. r very large), g is —bp. But as dx 
grows, the present value of the future cash flows increases, un- 
til at some point it equals the price of the bond bo, and, hence, 
g(d,) = 0. 


Let us use Newton’s method to compute this yield-to- 
maturity. The first derivative g’ of (16.3) is 


ndi le, + nde !cn + nd lcn +- 


and the updating rule becomes 


e( w) 

d&+D — ge) 2V 

* i (®©) 
g' (dx 


In the following R script we define a 6-year coupon bond with a yield to maturity of 4.6%. Note 
that we have not taken the trouble to provide the analytic derivative but rather computed it by a 
forward difference. 


= ee a eld, 8, 57 5, 5, OS) sak Ceri EIOS 
> tm <- 1:6 ## times to maturity 
> ytm_TRUE <- 0.046 ## the ‘true’ yield 


> bO <- sum(cf/ (1+ytm_TRUE) *tm) 
= GE <= Elo, che) 
= am <= e(O ici) 


Se <= Ooi ## initial guess for r 
> h <- 1le-6 ## finite-diff step-size 
> dr <- 1 ## change in r 

> while(abs(dr) > 1le-5) { 


G e sumi(et/ (Clee) tm) 


dg < ( sum(cf/((1l+r+h)*tm)) - g ) /h 
dr <- g/dg 
print(r <- r - dr) 
} 
[1] 0.0359 
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[1] 0.046 
[1] 0.046 


The initial guess is 10%;' and after a few iterations we obtain the true yield-to-maturity of 
4.6%. Of course, such a computation should not be run as a script, but belongs into a function. 
In the NMOF package, the computation is implemented in function ytm. It gives the same answer. 
(The package also defines a function vanillaBond that computes the present value of a vector 
of payments, given a set of discount factors.) 


S Wem (Cie, ie) 


[1] 0.046 


Practically we will rarely work with a constant r, but assume different rates y, for different 
maturities. We will later often write y(t) to stress that y is a function of t. We obtain 


1 Vv! 1 \ 
m= (raz) ont (ae) ot 


1 a 1 2 
— Cr + Cn +- 
(=) ý ae) k 
We define forward rates f 


1+ yxy = d+ fo. A + fam) d+ fuau). 


The subscripts to f indicate the start and the end of the period. 
We may as well mention that all the quantities discussed can also be expressed in continuous 
time. Let 


f =logd+ f), r'=log +r), y =log(1 + y); 
then we have 
d, =exp(—tr’) 


and so on. When manipulating equations, continuous time expressions are sometimes easier to 
handle. Nevertheless, in this chapter we stay in discrete time. 


Constructing yield curves 


When we speak of yield curves in the remainder of this chapter, we mean zero rates y, or their 
associated discount factors and forward rates. (Knowing either quantity for all maturities allows us 
to determine both others.) There are, in a fundamental sense, two ways to transform discrete data 
points to a yield curve, i.e. a function y(t) which specifies a y for any t: we can interpolate the 
available data points, or we can approximate. When we interpolate, we obtain a function y(t) which 
exactly prices the instruments that were used to construct the function. When we approximate, then 
the curve y(t) prices the instruments only approximately; see Fig. 16.1. In this chapter, we will 
deal with the second approach; still, we will outline the essence of the first approach since it is 
widely applied in practice, and both methods can also be combined. 

Interpolation is based on bootstrapping—which has nothing to do with the resampling technique 
that goes by the same name. The textbook version of bootstrapping goes like this: we have m bonds 
with time to maturity t; = 1,2,...,m; each bond pays a fixed coupon at the end of each period. 
Our aim is now to find discount factors d such that 


1 = Daye? 
i 


1. For coupon bonds reasonable starting values are usually the current yield, i.e. coupon divided by bond price, or simply 
the coupon rate. 


[ytm] 


[Matrix] 
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ponnoos = a a 


T T 


FIGURE 16.1 Interpolation and approximation. The panels show a set of market rates. Left: we interpolate the points. 
Right: we approximate the points. 


for all bonds j = 1,...,m. It is helpful to put this problem (and many other problems, too) into 
matrix form. We can rely on intuition and results from linear algebra; and sometimes we can directly 
implement the model in matrix form and so make use of fast available algorithms. In the textbook 
version of bootstrapping, we obtain the system of equations 


ab) co) 
či dı bo 
D D 2) 
ci c3 dy bo 
6) 6) 6) = 6) 
ci C5 C3 d3 bo (16.5) 
(m) (m) (m) (m) (m) 
1 Cy C3 m din bo 
Se 
Ç d b 


which we can solve by forward substitution (see Chapter 3). The jth row in the matrix C corre- 


sponds to the cash flows of the jth bond with price bi! The computed discount factors (i.e. the 
vector d that solves the equations) will “interpolate” the bond prices; thus, the model prices will 
exactly equal the market prices. 

But in practice we never have this nice case. C will not be square. We may have more dates than 
bonds (more columns than rows), and hence, potentially, an infinity of solutions. Or we may have 
more bonds than dates (more rows than columns); then in general no solution exists. So, C will 
not have a nice triangular structure. As an example, Table 16.1 shows a list of German government 
bonds as of the end of May 2010. These bonds are included in NMOF as the data set bundData. 

Putting the cash flows of these bonds into a matrix results in the plot shown in Fig. 16.2. The 
plot was created with the Matrix package (Bates and Maechler, 2018). Try: 


> i Drary (Matrixi) 


= (mM <= Marrie 0 00, 0y 3, 2)) 


3 x 2 sparse Matrix of class "dgCMatrix" 


[1,] 1 
[2,] . 
[3,] . 1 


> image (M) 


The second function call results in a plot like in Fig. 16.2. Such plotting devices are helpful to 
recognize structure in a matrix. 

The data set has more payment dates than bonds, that is, there are more columns than rows. 
This form of C is the practically relevant case for Germany; see Jaschke et al. (2000). 

From a purely numerical point of view, we need to solve 


Cd=b (16.6) 


for d, under the constraints 
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TABLE 16.1 A sample data set of German government bonds. Prices as of 31 May 2010. 


ISIN coupon maturity dirty price 
DE0001135150 5:23 2010-07-04 105.225 
DE0001141471 2.5 2010-10-08 102.448 
DE0001135168 5235 2011-01-04 105.173 
DE0001141489 3.3 2011-04-08 103.282 
DE0001135184 5 2011-07-04 109.642 
DE0001141497 BoD 2011-10-14 106.555 
DE0001135192 5 2012-01-04 109.396 
DE0001141505 4 2012-04-13 107.248 
DE0001135200 5 2012-07-04 113.852 
DE0001141513 4.25 2012-10-12 111.383 
DE0001135218 4.5 2013-01-04 111.627 
DE0001141521 3.5 2013-04-12 108.469 
DE0001135234 BMS) 2013-07-04 112.241 
DE0001141539 4 2013-10-11 112.864 
DE0001135242 4.25 2014-01-04 112.945 
DE0001141547 2 2014-04-11 104.821 
DE0001135259 4.25 2014-07-04 115.747 
DE0001141554 25 2014-10-10 106.672 
DE0001135267 SUA 2015-01-04 111.571 
DE0001141562 2.5 2015-02-27 105.405 
DE0001141570 DBS 2015-04-10 103.547 
DE0001135283 BBS 2015-07-04 110.815 
DE0001135291 3.5) 2016-01-04 110.589 
DE0001134468 6 2016-06-20 128.904 
DE0001135309 4 2016-07-04 115.669 
DE0001 134492 5.625 2016-09-20 125.13 
DE0001135317 SAS 2017-01-04 112.071 
DE0001135333 4.25 2017-07-04 117.547 
DE0001135341 4 2018-01-04 113.343 
DE0001135358 4.25 2018-07-04 WAVY 
DE0001135374 BIS) 2019-01-04 111.231 
DE0001135382 3.5 2019-07-04 111.235 
DE0001135390 3235 2020-01-04 107.140 
DE0001135408 3 2020-07-04 103.161 
DE0001 134922 6.25 2024-01-04 138.951 
DE0001135044 6.5 2027-07-04 148.880 
DE0001135069 5.625 2028-01-04 133.666 
DE0001135085 4.75 2028-07-04 124.534 
DE0001135143 6.25 2030-01-04 144.801 
DE0001135176 5.5 2031-01-04 133.995 
DE0001135226 4.75 2034-07-04 126.884 
DE0001135275 4 2037-01-04 112.663 
DE0001135325 4.25 2039-07-04 120.167 
DE0001135366 4.75 2040-07-04 130.134 


A typical value for dmax used to be 1, which implies that interest rates cannot become negative. In 
any case, since a solution may not exist or may not be unique, we may have to settle for 


argmin ||Cd — b|| (16.7) 
d 


where ||.|| is a specified norm. 

There are several ways to obtain d from Eq. (16.7). We could try a purely numerical approach 
and minimize the residuals. If a solution exists for an underidentified system as we have here, it will 
not be unique, so we may need to add further constraints; typically, procedures to solve the system 
will also minimize a norm of the solution. In practice, with an underidentified system, people often 
prefer to add identifying assumptions that are economically motivated. For example, suppose there 
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FIGURE 16.2 The matrix C for the cash flows of the 44 bonds in bundData, each square represents a nonzero entry. 
Each row gives the cash flows of one bond, each column is associated with one payment date. 


are two bonds that pay coupons annually. The first bond expires in 0.5 years, the second in 1.25 
years. So we have two prices by ’ and bE , but three payment dates: 0.25, 0.5, and 1.25 years. We 
could assume now that the very short end of the curve is flat below our shortest zero bond; that 
is, we assume that the rates for 0.25 and 0.5 years are the same. Now we are left with only two 
unknowns. 

With an overidentified system, i.e. more bonds than payment dates, we can accept an approxi- 
mate solution for which we minimize the residuals of Eq. (16.6), or we could throw away rows, i.e. 
use only a subset of the available instruments. 

No matter how we decide to solve this problem, we will end up with a discrete set of discount 
factors (those in the vector d) for specified maturities. To price cash flows at other dates, we have 
to interpolate these zero rates, or fit an approximating function through the data points. The latter 
approach is the one we describe in the next sections. We will discuss a widely used model, that of 
Nelson and Siegel (1987). 


16.1.2 The Nelson-Siegel model 


The model of Nelson and Siegel (1987) and its extension by Svensson (1994) are used by many op- 
erators, in particular at central banks, as a model for the term structure of interest rates (Bank 
for International Settlements (BIS), 2005). Unfortunately, model calibration, that is, obtaining 
parameter values such that model yields accord with market yields, is difficult. Various authors 
have reported “numerical difficulties’ when working with the model; for instance, Bolder and 
Stréliski (1999), Cairns (1998), Gurkaynak et al. (2006), and De Pooter (2007). In this section, 
we analyze the calibration of the model in more detail. We show that the problem is twofold: 
first, the optimization problem is not convex and has multiple local optima. Hence methods that 
are readily available in statistical packages—in particular, methods based on derivatives of the 
objective function—are not appropriate to obtain parameter values. We will use an optimiza- 
tion heuristic, Differential Evolution, to obtain parameters. Second, in certain ranges of the pa- 
rameters, the model is badly conditioned, thus estimated parameters are unstable given small 
perturbations of the data. We discuss to what extent these difficulties affect applications of the 
model. 

Nelson and Siegel (1987) suggested modeling the yield curve at a point in time as follows: let 
y(t) be the zero rate for maturity t, then 


exp( 7) : (16.8) 


y(t) = Bi + fo je | + B3 | — exp(—t/A1) 


T/a T/a 


Thus, for a given cross-section of yields, we need to fix four parameters: 61, 62, 63, and à1. We do 
not assume that the model’s parameters are constant; they can change over time. But in this chapter 
we are only interested in the cross-section; hence, to simplify notation, we do not add subscripts 
for the time period. 
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In the Nelson—Siegel (NS) model, the yield y for a particular maturity is the sum of several 
components. 6; is independent of time to maturity, and so it is often interpreted as the long-run 
yield level. £2 is weighted by a function of time to maturity. This function is unity for t = 0 and 
exponentially decays to zero as t grows; hence, the influence of £2 is only felt at the short end of 
the curve. 63 is also weighted by a function of t, but this function is zero for t = 0, increases, and 
then falls back to zero as t grows. It thus adds a hump to the curve. The parameter à; affects the 
weight functions for 62 and £3; in particular it determines the position of the hump. An example is 
shown in Figs. 16.3-16.5. 

The parameters of the model thus have, to some extent, a direct (observable) interpretation, 
which brings about constraints. For instance, to impose non-negativity, we require 6; > 0 (long- 
term rate needs to be positive), and 6; + 62 > 0 (short-term rate needs to be positive). We may 
set other lower bounds as well. We also require that A; > 0. In the examples we show later, we 
will always add non-negativity constraints, primarily to show how such constraints might be imple- 
mented. They may then easily be updated for cases with negative rates. Two conventions regarding 
the model: first, we define parameters in units of percentage points, so 6; = 2 means 2%; second, 
we measure all maturities in years. 

The Nelson—Siegel—Svensson (NSS) model adds a second hump term (see Fig. 16.5) to the NS 
model. Let y(t) be the zero rate for maturity t again, then 


Component Resulting yield curve 
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FIGURE 16.3 Level. The left panel shows y(t) = 81 = 3. The right panel shows the corresponding yield curve, in this 
case also y(t) = 1 = 3. The influence of 61 is constant for all t. 


Component Resulting yield curve 
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FIGURE 16.4 Short-end shift. The left panel shows y(t) = fo [LSE] for 62 = —2. The right panel shows the 


yield curve resulting from the effects of B, and £2, that is, y(t) = 1 + Bo [ ee ] for £1 =3, Bo = —2. The short 
end is shifted down by 2%, but then the curve grows back to the long-run level of 3%. 


Component Resulting yield curve 
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FIGURE 16.5 Hump. The left panel shows 63 ee = exp(—t/21)| for 63 = 6. The right panel shows the yield 


curve resulting from all three components. In all panels, A, is 2. 


[ns-factors] 
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eee Í =] pe P = exp(=*/i1) 


t/a T/A 
1 — exp(—*/r2) 
f J ho 
Now we need to estimate six parameters: 61, 2, 63, Ba, 41, and Az. Constraints on the parameters 


remain the same; we also have àz > 0. Given a set of parameters, the forward rates in the NS model 
are given by 


exp( a) (16.9) 


+ b4 | exp( sha) ; 


f@=Bi + Boexp( ~) + Bs; exp( 


and for the NSS case we get 


T T T T T 
mea e a a 


There exist many variants of this model (De Pooter, 2007), with the simplest idea being to add more 
humps. But depending on the purpose of using the model, this may not be a good idea, as we show 
next. 


Collinearity 


A first step after having implemented a model such as Eqs. (16.8) and (16.9) is to “play around” 
with it. This has at least two benefits: we gain intuition about the model and—since we only just 
implemented it—we may spot errors in our code. In fact, we can investigate one unfortunate prop- 
erty of the Nelson—Siegel model (and its variants) without any data, simply by “playing around” 
with it. 

Suppose we have obtained zero yields y“ for m different maturities t,,..., Tm. The super- 
script M in y™ indicates that these are market rates. Now we wish to calibrate an NS model to these 
yields. Fixing the A,-value, we have m linear equations from which to estimate three parameters: 


T = =i M 
1 ert 1/41) 1 a Waa) exp(—"1/A1) Oss 
1—exp(—? 1-exp(—12/a y mn 
1 ep Bin ua 2/1) exp(—2/a1) B M(t) 
1 T: 1 T 
1 N E epla) || fa] (16.10) 
. . B3 Š 
1— — Tm 1— —tm : 
1A ARGAD — exp(—n/M) Mtn) 


We need to solve these equations for £. 
We can interpret the NSS model analogously, just now we have to fix two parameters, 4; and 
A2. Then we have a fourth regressor 


/ 
1 — exp(—ti/A2) 

= i exp( “| ; 
and can proceed as before. This system of equations is overidentified for the practical case m > 3 
(or m > 4 for NSS), so we need to minimize a norm of the residuals. Below we will discuss the 
Least-Squares case, i.e. we use the 2-norm. We do not stress a specific solution technique, but are 
interested in the conditioning of the regressor matrix. 

We can compute the regressors of this linear model with the functions NSf and NSSf, provided 
by the NMOF package. 


> NSE 
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function(lambda, tm) { 
aux <- tm/lambda 
X <- array(1, dim = c(length(tm), 3L)) 


X[, 2L] <- (1 - exp(-aux) ) /aux 
X[, 3L] <- ((1 - exp(-aux))/aux) - exp(-aux) 
X 


} 


<environment: namespace :NMOF> 


> NSSE£ 


function(lambdal, lambda2, tm) { 
auxl <- tm/lambdal 
aux2 <- tm/lambda2 
X <- array(1, dim = c(length(tm), 4L)) 


X[, 2L] <- (1 - exp(-auxl1) )/auxl 

X[, 3L] <- (1 - exp(-auxl))/auxl - exp(-aux1) 
X[, 4L] <- (1 - exp(-aux2))/aux2 - exp(-aux2) 
X 


} 


<environment: namespace :NMOF> 


Note that the resulting matrix will not depend on the observed rates y“, only on the maturities that 
we use and the values for 41. Some authors suggest linking A; to the longest time to maturity in 
the sample; specifically, it should not be larger than the longest maturity; see Bolder and Stréliski 
(1999, p. 49), Manousopoulos and Michalopoulos (2009, p. 596), Wets and Bianchi (2006, p. 57). 
Suppose that we have bonds with a maturity up to ten years, and we fix A; at 6. Then the call 


= Cor (sie (lems! = 6, iam = L10) 


[,1] [,2] [,3] 
[1,] T NA NA 
[2,] NA 1.000 -0.975 
[3,] NA -0.975 1.000 


reveals a correlation of —0.97 between the second and the third regressor. If we use the NSS model, 
we see similarly high correlations. 


= Clore (ieSir (lamassa = al, leime? = 5, jem = ileal) y 


[,1] [,2] [,3] [,4] 
[1,] 1 NA NA NA 
[2,] NA 1.000 0.845 -0.996 
[3,] NA 0.845 1.000 -0.871 
[4,] NA -0.996 -0.871 1.000 


(A correlation between a constant and a varying quantity is not defined; hence the NA values.) 
Fig. 16.6 shows the correlations for different values of A; and Az. Over wide ranges of the 
A-parameters the correlations are either —1 or 1. 

Practically we may have 20 to 40 maturities, so 20 to 40 equations. We should feel uncomfort- 
able whenever we estimate a linear regression model with 20 to 40 observations when the regressors 
are so highly correlated. The result of such correlation is an identification problem: we cannot ac- 
curately compute the regressors anymore. But a bad conditioning of the regressor matrix does not 
imply large residuals (see Section 2.5). In other words, we may well fit our model to market rates 
and so have small residuals; but the parameters themselves are not reliable. We show an example 
of the effects of collinearity in the next section, when we discuss the calibration of the model. 


[nss-factors] 


[cor-ns] 


[cor-nss] 


[ns-fit] 
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FIGURE 16.6 Nelson—Siegel—Svensson: The three panels show the correlation between the second and the third, the sec- 
ond and the fourth, and the third and the fourth regressors in Eqs. (16.10) for different A-values (the x- and y-axes show A 
and àz between 0 and 25). 


16.1.3 Calibration strategies 


In this section we discuss numerical approaches to obtain parameter estimates for the NS/NSS 
model. 


Case 1: Linear regression with bootstrapped zero rates 


Aim: to find £-parameters for the NS/NSS model such that y(t;) is close to yM(z;) for all i, given 
fixed A-values. 


Suppose we have obtained a set of zero rates from a bootstrapping procedure, and now wish to find 
the NS or NSS parameters that approximate this set of rates. There is a simple strategy to obtain 
parameters for both the NS and NSS model: fix the A-parameters, and then estimate the 6-values 
with Least Squares (Nelson and Siegel, 1987, p. 478); see Eqs. (16.10). This estimation approach 
is easily implemented. In the following code we set up a true yield curve and then estimate the 
parameters with R’s 1m function. The functions NS and NSS are included in the NMOF package. 
The fixed A; in the regression does not equal the true 41, yet the errors are small. 


‘cml <= ie O 

paramTRUE <- c(4, -2, 2, 1) 

yM <- NS(paramTRUE, tm) 

dorea lMiipar m Parnim) 

plot(tm, yM, 
xlab = "Maturities in years", 
ylab "vields in %-points", 
type = "b") 


WONG OWE ME OWE 


n 40 _o-0-0-0-0-0-0 
= P 
© 38 
ia o 
xL 
T 36 
8 
3 34 
> 
o 
2 4 6 8 10 


maturities in years 


> lambda <- 1.5 ## fix lambda and run regression 
result <- lm(yM ~ -1 + NSf (lambda, tm) ) 


Vv 


Vv 


## compare results 
do.call(par, par.nmof) 


Vv 
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> plot(yM - result$fitted.values, 


xlab = "maturities in years", 
ylab = "errors in %-points", 
type = " h n ) 


> abline(h = 0) 


0.03 

0.02 

0.01 | | | 
0.00 

i TT 


errors in %-points 


maturities in years 


This approach has a number of disadvantages, though. We need to use the Least-Squares crite- 
rion to measure goodness-of-fit; inequality constraints are more difficult to add; and we still have 
to do a line search (NS) or a grid search (NSS) for good A-values. But the Least-Squares approach 
allows us to demonstrate the effect of collinearity in a context of estimating a linear regression. 

We first define true NS parameters, namely c(4,-2,2) for the £, and 1 for Aj, as before. 


> lambda <- 1 
> betaTRUE <- c(4, -2, 2, lambda) 
SS jem <= deal) 


With these we compute y™ from Eq. (16.8). Then we use 1m to estimate the parameters. Unsurpris- 
ingly, we get exactly the true parameters. 


> yM <- NS(betaTRUE, tm) [ns-1m] 
> Im(yM ~ NSf(lambda, tm) - 1) 
Call: 
lm(formula = yM ~ NSf (lambda, tm) - 1) 
Coefficients: 
NSf (lambda, tm)1 NSf(lambda, tm)2 NSf(lambda, tm)3 
4 -2 2 


Now we add a bit of random noise, order of magnitude 1 basis point (bp), to the y“, and then 
redo our estimation. We repeat this 1000 times, as shown in the next script. Note that we have not 
used the function 1m but . 1m. fit, which is much faster. (See Appendix 3.A.) We then repeat the 
whole procedure a second time, but now with a A; of 10. 


> trials <- 1000 [ns-fit-noise] 
> yM_lambda_10 <- yM_lambda_1 <- 

Elsie NA, Chin = ex(icieiells, S) 
> colnames(yM_lambda_10) <- colnames(yM_lambda_1) <- 

c("beta_1", "beta_2", "beta_3") 


lambda <- 1 
betaTRUE <- c(4, -2, 2, lambda) 
tm <- 1:10 
for (t in seq_len(trials)) { 
yM <- NS(betaTRUE, tm) + 
rnorm(length(tm), sd = 0.01) 


WWE NY NY 


[fig-ns-fit-noise] 
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FIGURE 16.7 Distributions of estimated parameters. The true parameters are 4, —2, and 2. 


yM_lambda_i[t, ] <- 
.lm.fit(NS£(lambda, tm), yM)$coefficients 


> lambda <- 10 
> betaTRUE <- c(4, -2, 2, lambda) 
> for (t in seq_len(trials)) { 
yM <- NS(betaTRUE, tm) + 
rnorm(length(tm), sd = 0.01) 
yM_lambda_10[t, ] <- 
.lm.fit(NS£(lambda, tm), yM)$coefficients 


} 


Fig. 16.7 shows estimates for the parameters when 1, is fixed at 1 (the gray lines), and 10 (the 
black lines). 


= ior (a. aim iLe3)) 1 
plot (ecdf(yM_lambda_10[, il), 
zdal = 0, 
Walale = Tu, 
mada = Suleysteslicwiea(locieala ||, liseli = a) ))) 
abline(v = betaTRUE[i], col = grey(0.6)) 
lines (ecdf(yM_lambda_1[, il), 
sles) = v4, 
yleis = on, 
mera = Swiloystemicwee (losteeilfa ||, Listi = a i 
col = grey(0.4)) 
} 


While the magnitude of the noise was the same in both cases (about 1 bp), we see massive 
differences in the distributions of the parameters. For a 4; of 1, we are always close to the true 
parameters; for a A; of 10, we are sometimes way off. (Check the correlations between the regres- 
sors.) This is not necessarily a problem if we want to fit the curve, i.e. if we want ||yM — y|l2 or 
another norm to be small. But it is a problem if we want to work with the parameters (e.g. for 
forecasts, such as in Diebold and Li, 2006), or want to interpret them. Do not expect these errors to 
be smaller in the NSS model, or in more-complex variants. We need to stress that these large errors 
did not result from an inappropriate method—they solely resulted from our choice of A. The only 
remedy is that if we want to identify the parameters, we need to constrain the A-values such that 
the resulting correlations remain reasonable. This also applies to the cases that we discuss next. 


Case 2: Fitting bootstrapped zero rates with Differential Evolution 
Aim: to find parameters for the NS/NSS model such that y(t;) is close to yM(z;) for all i. 


For Least Squares we had to fix a à. Now we want to determine all parameters in one run. For 
this, we need another optimization technique, a heuristic. Why would we not use the Least-Squares 
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FIGURE 16.8 Negative interest rates with Nelson—Siegel model despite parameter restrictions. 


approach? Finding a good à requires a grid search, but Least Squares is so fast that this is not 
really a hindrance. (Some variants of NSS require more parameters, however.) The advantage of a 
heuristic is that it is much more flexible. First, we can write down any objective function, not just 
a sum of squares. And we can add further constraints. Let us give an example. Suppose we require 
that 6; > 0, and 6; + £2 > 0 to have nonnegative interest rates. Yet, even with these constraints not 
violated we may have negative interest rates, as is shown in the next code example. See Fig. 16.8. 


> tm <- seq(1, 10, length.out = 30) ## 1 to 10 years 
> paramTRUE <- c(3, -2, -8, 1.5) ## ‘true’ parameters 
> yM <- NS(paramTRUE, tm) 
> do.call(par, par.nmof) 
= jollote (iam, wall, 
xlab = "maturities in years", 
ylab = "yields in percent") 


> abline(h = 0) 


The constraints on the parameters are not violated, yet the parameters imply negative interest 
rates. (Conversely, for a finite set of maturities, we may find positive interest rates even though the 
B parameters violate their constraints.) A more direct way would be to require that y(t) > 0 for all 
t > 0. This is a nonlinear constraint, but poses no problem for a heuristic. Some testing showed 
that it is usually not necessary to include such a constraint (you may find the implementation in 
Schumann, 201 1—2018a). But importantly, if we decide to leave it out, it is because we deem it not 
necessary—not because our optimization technique cannot handle it. 

The formal optimization model to solve is: 


, M 
mın = 
m ly— yll 


subject to 
1 —exp(—t/a Eq. (16.8) for NS, 
yO = pi + pa | EE = E es 
T/A Eq. (16.9) for NSS 
bi, >0 
Bi + B2 > 0 


4; >0 (=1forNs,i=1,2 for NSS). 


Note that ||y — y™|| could be any norm (not necessarily the 2-norm) or distance function. To 
solve this model, we turn to a heuristic method, Differential Evolution (DE). The algorithm for DE 
was given on page 281; we repeat it for convenience in Algorithm 63. 

An implementation is provided in function DEopt ., in which the final . serves as a reminder 
that it is an abbreviated version of function DEopt, as contained in the NMOF package. 


[negative-rates] 
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Algorithm 63 Differential Evolution for yield curve models. 
1: set np, ng, F and CR 


2: randomly generate initial population Bes J=1,...,d,i=1,...,mp 


3: fork = 1 tong do 
pO) — pO 


4 
5 for i = 1 tonp do 

6: randomly generate £1, £2, £3 € {1,..., np}, 41 Fl. 4 43 Fi 
7 compute pe = ae +Fx (PY. — PY) 

8 for j = 1 to d do 

9 if rand < CR then ae = pr else P = pri 


10: end for 

nu: if (PM) < f(PO) then PP = P™ ese P? = PO 
12: endfor ` ' i i l i 
13: end for 


14: return best solution 


[DEopt.] ADEOLA EUC ONO a GOE S cos) % 


d <- length (algo$max) 

nP <- algo$nP 

vFv <- vF <- rep(NA, nP) 

glane e <= OUS RE Eee ib, eO = ile’ = N) 


mP <- runif(d*nP) * (algoSmax - algo$min) + algo$min 
dim(mP) <- c(d, nP) 


for (s in seq_len(nP) ) 
vamps ss Ol GME, Sil, 466) 
for (g in seq_len(algo$nG)) { 
vI <- sample.int(nP) 
R e val [ishmetst 
R2 <- Ril[shift 
RI <- R2[shift 
mPv <- mP[, R1] + algoSF * (mP[, R2] - mP[, R3]) 
mI <- runif(d * nP) > algo$CR 
mPv[mI] <- mP[mT] 
for (s in seq_len(nP) ) 
veviks <= ORME Slly oa) 
is.better <- vFv < vF 
mP[, is.better] <- mPv[, is.better] 
vF[is.better] <- vFv[is.better] 


} 
list(xbest = mP[, which.min(vF) [1L]], 
OFvalue = min(vF) ) 


} 


The function DEopt . is a (relatively) straightforward translation of the given pseudocode. To 
see the complete function, attach package NMOF, and type DEopt. 
The function DEopt is called with three arguments: 


S iDiaejoe (Or, Gulicier, oo <)) 


The ... allow to pass several variables, but in this chapter we will always collect all additional 
variables in a single list Data. So will always call DEopt in this way: 


> DEopt (OF, algo, Data) 
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Doing so makes for a clearer structure and is less error prone in our experience, because within 
a function we always have to explicitly ask for the variables within Data. So altogether we need 
three objects: OF, algo, and Data. 


OF is the objective function. 
algo isa list that holds the settings of the DE algorithm, as discussed below. 
Data isa list that holds the pieces of data necessary to evaluate the objective function. 


When it has finished, DEopt will return a list of several components; among them: 


xbest is the solution; the parameter vector with the lowest objective function value. 

OFvalue is the objective function value associated with xbest. 

popF is the vector with the final objective function values of all population members (i.e. OF- 
value equals min (popF) ). 

Fmat is a matrix of size ng X np. It holds the objective function values of all solutions over time. 


Implementing an optimization technique always requires decisions on various parameters and set- 
tings. If we use opt imin R or fminunc in MATLAB®, we may not be aware that these parameters 
are there, but they are. They have been set, but not necessarily ideally for all problems. This is not 
a case of a flawed implementation. Optimization always requires testing, evaluation, and checking. 
For this, the Analyst should have an idea of what certain parameters do. So specifically for DE, we 
need to think about the following settings, most of which are passed with the list algo. 


Initial population The initial solutions are drawn from a uniform distribution over specified 
ranges. These ranges need to be fixed through the vectors algoSmin and algo$max. The 
ranges serve two purposes: they tell the algorithm how many elements a given solution has 
(length (algo$min) ), and over what range to initialize the population. The ranges are no 
constraints (at least not per default: with the option algo$minmaxConstr set to TRUE, 

DEopt will treat them as constraints). Not specifying algo$min and algo$max will pro- 
duce an error. 

Step size (F) Recall that DE computes a new solution by adding a weighted difference between 
two existing solutions to a third one. F is the weight, normally set between 0 and 1. Smaller 
values mean that a given vector is changed by smaller increments; hence, we make smaller 
steps when we move through the solution space. 

Probability of crossover (CR) A number between 0 and 1. A high value means that a new solution 
is changed along many dimensions; a low value indicates that a change affects only few 
elements of a solution. 

nG (ng) The number of generations. 

Population size np (nP) The number of solutions in the population size. 

Stopping criterion We use a simple stopping criterion: a fixed number of function evaluations, 
given by the product ng x mp. We can often reduce computing time by adding a break if the 
population has converged; that is, when all solutions are similar (in terms of parameters, or 
objective function value). 


We also need to think how to implement the constraints. But before we do so, let us discuss the 
algorithm so far. 

The way we initialize the population in DEopt . may require a brief explanation. We want to 
create a matrix of nP columns, and each column should be “between” min and max, which we 
may do efficiently by using R’s recycling rule. The following toy example illustrates how it works. 


mesa << hes} 

mee <= B87 

nP <- 6 ## population size 

d <- length (min) 

M <- (max - min)*runif(d*nP) + min 
dim(M) <- c(d, nP) 


M A E VLNY. 


[recycle] 


[shift] 
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[1,] 4.28 4.61 2.46 3.72 3.34 3.16 
[2,] 5.32 4.917 3.55 2.33 5.39 2.72 
[3,] 3.29 6.45 4.38 6.08 3.39 5.93 


Algorithm 63 contains three nested loops. In both inner loops, however, iteration i+1 does not 
depend on iteration i. Hence the two inner loops can in principle be vectorized. This is intuitive: in 
these loops we create new solutions and evaluate them, but there is no need to do this in a specific 
order. For expensive objective functions, we could even distribute the evaluation. 

To create new solutions, we first compute auxiliary solutions P) as follows: 


Pp =F p% z p% + | PO 


This vector addition/subtraction is executed for all np solutions in the population. Let x be a random 
permutation of the numbers from 1 to np, and x; be the ith number in this vector. Let z’ and z” 
be two further permutations such that m; 4 7;, x; A 7;’, and m; # 7? for all i. This is easily 
implemented by randomly drawing one permutation z and then shifting the elements in the vector, 
for instance, 7; = 7;_1 with xo = pp. In MATLAB, the function circshift does just that; in R 
we could write our own function. 


= Slashing <= Eeen ea) 4 
n <- length(x) 
1 (x hea ee Sc cial erat (ervey) |p) 


} 
= amnre (as 5) 


[1] 51234 
>= Sipai ie (elo (ila 5))) )) 


[1] 45123 


A more general function shift is available in the magic package (Hankin, 2005), available from 
CRAN. Now we can write the instruction to create P® as: 


0 0 0 (0) 0 0 
Popa... [ar] |POP® .. | -|rPQPQ _. | | + PON, 


(If you look into DEopt . or DEopt you will find that the code for shift is actually inlined.) 
In any case, we compute all auxiliary solutions in one step by adding and subtracting matrices 
instead of vectors. This is much faster than looping through the columns of P. The crossover in 
line 9 of Algorithm 63 can also be vectorized; if not for speed, then for clarity. 

Now that we have new solutions, we need to evaluate them. In a given iteration, the simplest 
approach is to loop over the solutions, and to compute the objective function for each solution: 
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> for (s in seq_len(nP) ) 
MALS) <= ONG, Silly oss) 


Alternatively, we could have done this with 
S why <= GAD iy, 2B, Wl, ooo) 


but this rarely makes the code faster, since it essentially still loops over the solutions; see Chambers 
(2008). We can often accelerate the computation by truly evaluating the whole population in one 
step, that is, by truly evaluating it in a “vectorized way.” While the code becomes faster, it also 
becomes more memory-hungry and sometimes less straightforward—and in any case, it is not 
always possible to vectorize. But if it can be done, it is often worth the effort because the computing 
time can be substantially reduced. In the complete code of DEopt, we find the following structure: 


> if (algoS$loopOF) { 
icone (e alia, aL eial>)) 
Walls] =<- OP Pv; Sil, oo.) 
} else { 
VF <- OF(mPv, ...) 
} 


Thus, the parameter algo$loopOF determines whether R loops over the columns of mPv, or 
whether we pass the whole population to the objective function. In the latter case, of course, we 
need to write OF differently. Normally, we would call the objective function as 


> OF (Se@liieieim, aoa) 


in which solution is a vector, i.e. a single solution. OF would then return a scalar, the objective 
function value of solution. If we can vectorize the evaluation, we need to write the function 
such that solution is the whole matrix P of solutions. The function then must return a vector 
of objective function values, which correspond to the columns of P. For the problem here, we will 
do this only for the constraints, discussed below. For examples of vectorized objective functions, 
see page 422 and page 530. The NMOF package also has vignette on the topic, which you can open 
with the following command: 


> ## to open a list of vignettes in the browser, say 
> ## browseVignettes ("NMOF" ) 


> vignette("vectorise", package = "NMOF") 


But back to the example. Let us look at the objective function. We evaluate a particular vec- 
tor param through the objective function OF. We compute model rates y for the given tm and 
parameters param, and return the maximum absolute difference between y and yM. 


> OF <- function(param, Data) { 
y <- Data$Smodel (param, Data$tm) 
maxdiff <- y - DataSyM 
maxdiff <- max(abs(maxdiff) ) 
if (is.na(maxdifE£) ) 
maxdiff <- 1e10 
maxdiff 


} 


OF takes two arguments, a solution param (a vector of parameters) and further information 
collected in Data. The list Data comprises, among other things, the times-to-payment (tm) and 
the market rates (yM), both numeric vectors; and the model (e.g., NS), which is a function. The 
model expects as inputs param and tm. 


[vignette] 


[Data] 


[OF-pen] 


[penalty1] 
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> Data <- list(yM = yM, 


tm = tm, 

model = NS, 

pen.w = 0.1, 

min = e O S Oj, 


EiS, 30, S0 MO) 


max 


Finally we want to enforce constraints. Constraints can be incorporated in DEopt either through 
repair mechanisms, or through penalties. In this section, we choose the latter option. The most 
straightforward way is to compute the penalty directly in the objective function; possibly required 
parameters can be passed with Data. Suppose for instance we wanted to include the constraint y > 
0. An objective function might look like OF . pen. Data$pen .w is a weight. 


> OF.pen <- function(param, Data) { 
y <- Data$Smodel (param, Data$tm) 
res <- max(abs(y - DataSyM) ) 


## compute the penalty 

aux <- y - abs(y) # aux is zero for positive y 
aux <- -sum(aux) * DataSpen.w 

res + aux 


} 
Alternatively, we may factor out the penalty computation into its own function. 


> penaltyl <- function(param, Data) { 
y <- DataSmodel (param, Data$tm) 
neg.y <- abs(y - abs(y)) ## equiv. to '-2*pmin(y, 0)’ 
sum(neg.y) * DataSpen.w 
} 


Having a separate penalty function comes with a performance cost, since we have to evaluate 
model twice; but it is negligible in this case. DEopt also offers a separate call to a penalty function 
(and a repair function, too). This is helpful, again, in cases when we can vectorize only a particular 
computation (e.g., only the penalty evaluation, but not the objective function). 

To use this separate call, algo holds two functions, repair and penalty (both default to 
NULL). These functions, like the objective function, are called as 


S iri (Stollwieakom, 5 os) 


The repair function (if specified) is applied in every generation to new solutions before the 
objective function is computed. It returns a repaired.solution. penalty is applied after 
the objective function was called; it returns zero if no constraint is violated, or a positive number 
in case the particular solution violates a restriction. This penalty is added to the objective function 
value. 

As with the objective function, if these computations can be vectorized, they will usually be 
faster. Thus, as the default both functions take as arguments one solution (a vector), and they al- 
ways have passed the . . . argument. repair will then return a single repaired solution; penalty 
will return a scalar. Yet if we set 


est 


> algo$SloopRepair <- FALS 
> algo$SloopPen <- FALSE 


then, as with the objective function, we pass the whole population. repair then needs to return 
the repaired population, and penalty needs to return a vector. The penalty function serves as an 
example. We can compute the penalties in one step for the whole population. 


function(mP, Data) { 
minV <- DataSmin 

maxV DataSmax 

pen.w <- DataS$pen.w 


> penalty <- 


g= 


## if larger than maxV, 
A <- mP - as.vector (maxV) 
A <- A + abs(A) 


## if smaller than minV, 
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element in A is positiv 


element in B is positiv 


B <- as.vector(minV) - mP 

B <- B + abs (B) 

## beta 1 + beta2 > 0 

C <- pen.w » ((mP[1, ] + mP[2, 1) - 
aloe (me (LIL, 1] mEn 

A <- pen.w *« colSums(A + B) - C 

A 


} 


This function handles the minimum and maximum constraints, and the constraint that 6; + 62 > 


0.? The penalty function is passed with the 


algo list, along with the setting algo$loop- 


Pen=FALSE. An example: Suppose we have a population of three solutions. 


lambdal < 0 


## invalid: betal < 0 


"beta3","lambdal") 


> paraml <- c( 6, 3, 8, -1) ## invalid: 
> paren <= El 6, 3, B, 1) 
> paremo s= l-l, 3, 8, 1) 
> P <- cbhind(paraml, param2, param3) 
> rownames (P) <- c("betal","beta2", 
> P 
paraml param2 param3 
betal 6 6 -1 
beta2 3 3 3 
beta3 8 8 8 
lambdal -1 1 1 


We pass the whole population P to penalty, 


> penalty(P, Data) 


paraml param2 param3 
0.2 0.0 0.2 


and we receive a vector of penalties. 


A weight parameter pen . w, already added to 1 


> DataSpen.w <- 0.5 
> penalty(P, Data) 


paraml param2 param3 
il 0 1 


For valid solutions, the penalty should be zero. 


Data, controls how heavily we penalize. 


2. It can be helpful to scale the penalty to the magnitude of the objective function. This should be intuitive: a penalty of 


0.02 will not help much if the current value of the objective function is of a magnitude of 15,000, say. 


[penalty-vectorized] 


[penalty-example] 


[pen.w] 
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paraml <- c( 5, 3, 8, 1) ## three valid solutions 
eee <= EN Gz By, Sy L) 

joeucenss <> Ef 7, 3B, 8, dL) 

P <- cbhind(paraml, param2, param3) 

rownames(P) <- c("betal", "beta2", "beta3", "lambdal") 
penalty(P, Data) 


[valid-P] 


WANEN WE WWE OE 


paraml param2 param3 
0 0 0 


Let us go through the complete example. We also show that the code can be used with only 
minimal changes for the NSS model. (These examples are included in a vignette in the NMOF pack- 


age.) 


[de-ns] = ion <= e(e(il, By 6, Sh) le, gL) 
S lySiecuuuin qo eG, Si 3) ib) 
> yM <- NS(betaTRUE, tm) 


[OF] > OF <- function(param, Data) { 
y <- DataSmodel (param, Data$tm) 
maxdiff <- y - DataSyM 
maxdiff <- max(abs(maxdiff) ) 
if (is.na(maxdif£) ) 
maxdiff <- 1e10 


maxdiff 
} 
> Data <- list(yM = yM, 
tm = tm, 
model = NS, 
pen.w = 0.1, 
min = CY OLS ,—30), ©), 
max = eS, 30, 30, LON) 
> algo < list( 
ig? = ALO OIL, 
iMG: = DO 
ip = 0) 50); 
Gk = 0,9), 
mia S ef O,=15,=30, ©), 


eis, 30, 30, LON; 
pen = penalty, 

repair = NULL, 

loopOF = TRUE, 
loopPen = FALSE, 
loopRepair = TRUE, 
printBar = FALSE) 


Si 
D 
x 
Ul 


> system.time(DEopt (OF = OF, algo = algo, Data = Data) ) 


Differential Evolution. 
Best solution has objective function value 0 ; 
standard deviation of OF in final population is 3.89e-16 


user system elapsed 
0.381 0.000 0.381 


10.0 


9.5 


9.0 


8.5 


yields in % 


8.0 


7.5 


7.0 
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DEopt 
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FIGURE 16.9 Results for Case 2, Nelson—Siegel (NS) model: True model yields, yield curves fitted with DEopt, and 
yield curves fitted with nlminb. Plotted are the results of 5 restarts for both methods, though for DEopt it is impossible 


to distinguish between the different curves. 


We plot the true curve and the fitted curves in Fig. 16.9. As a benchmark, we also plot the 
curve as fitted with nlminb. This is not a fair test: nlminb is not appropriate for such prob- 
lems. (But then, if we found that it performed better than DE, we would have a strong indica- 


tion that something is wrong with 
s0. 


our implementation of DE.) We use a random starting value 


It is important to stress that the results of both functions (DEopt and nlminb) are stochastic: 
in the case of DE because it deliberately uses randomness; in the case of nlminb because we fixed 
the starting value randomly. To get more meaningful results we should run both algorithms several 
times. We will show below how this can be done easily with the function restartOpt; but here 
we use a simple loop. Note that it may look as if there is only one curve for DE, simply because all 


solutions exactly fit the true yields. 


= aortna <= 5 

do.call (par, 

Oot (item, SA 
salala = 
ylab = "yields 

algoS$printDetail <- 

for (i in seq_len(n.runs 
sol <- DEopt (OF = 
lines(tm, DataSmodel 
s0 <- algo$min + 


v 


par .nmof) 


Vv 


Q 


in % 


WOW 


(algo$max - algo$min) 
OF, 


sol2 <- nlminb(s0, 
lower 
upper 


Controle — 


lines(tm, Data$Smodel 
Coll = Casey7(0).5) 
} 


> legend(x = "topright", 


legend = c("true yields", 
grey (0.3) 
NA, NA), 


eel = Ce ( Molec 
pela = ECL, 


Voimeciass = 0.75, 


FALSI 


"Maturities in years", 


a) 


ORF 


(ESI 


yy at 
algo = algo, Data = 
(SollSxbest astm); Colls 


Data) 
grey (0.3) ) 


* runif (length(algoSmin) ) 
Data = Data, 

= DataSmin, 

= DataSmax, 

list (eval.max = 
iter max = 
E 


50000L 
50000L 


= 


(sol2Spar, 
le Ley = 3) 


“pine”, Vralilmaliale™)) , 
o Grey lOS] y 

Lesy = €(@, i, 3), 

oessa ley = 0) 


We run the model one more time and do some more checks. 


[ns-example] 
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[checks] > sol <=- DEopt (OF = OF, algo = algo, Data = Data) 


Vv 


## max. error and objective function value of solution 
## ==> should be the same 
> all.equal (max (abs (Data$model(sol$xbest, Data$tm) - 
Data$model (betaTRUE, DataStm))), 
solSOFvalue) 


Vv 


[1] TRUE 


> ## test: nlminb with random starting value s0 
s0 <- algoSmin + (algoSmax - algoSmin) * runif(length(algoS$min) ) 
> sol2 <- nlminb(s0O, OF, Data = Data, 
lower = DataSmin, 
upper = DataSmax, 
control = list(eval.max = 50000L 
iter.max = 50000L 


Vv 


> ## max. error and objective function value of solution 
> ## ==> should be the same 
> max(abs(Data$model(sol2$par, tm) - 

Data$model (betaTRUE, tm) )) 


[1] 0.489 


> sol2Sobjective 


[1] 0.489 
For completeness, we adapt the example for the NSS model. Results are shown in Fig. 16.10. 


## set up yield curve (here: artificial data), and plot it 
‘om <= elel 3, 6, MLA isi) 

betaTRUEĘE <- c(5, -2, 5, -5, 1, 6) 

yM <- NSS(betaTRUE, tm) 


[nss-example] 


> 
ee 
Be 
= 


> Data <- list(## collect everything in Data 
SAYL = SAM; 
tm = tm, 
model = NSS, 
mia = e @,=15,=30,—30, 0, 5), 
mee = (lS, 30, 30, 30, 5, i110), 
pen.w = 1) 


> algo <- list( 


ne -= LOL, 
nG = 500L, 
E = @.5@, 
GR = 0,99), 


iin = el @,=15,=30,=30, O, 5), 
mee = Cll, 30, 30; 30, 5, 10), 
pen = penalty, 

repair = NULL, 
loopOF = TRUE, 
loopPen = FALSE, 


Ea 
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FIGURE 16.10 Results for Case 2, Nelson—Siegel-Svensson (NSS) model: True model yields, yield curves fitted with 
DEopt, and yield curves fitted with nlminb. Plotted are the results of 5 restarts for both methods, though for DEopt it 
is impossible to distinguish between the different curves. Compare with Fig. 16.9: if anything, the results for nlminb are 
worse. 


loopRepair = TRUE, 
printBar = FALSE, 
printDetail = TRUE) 


> sol.DE <- DEopt(OF = OF, algo = algo, Data = Data) 


Differential Evolution. 
Best solution has objective function value 1.24e-14 ; 
standard deviation of OF in final population is 1.14e-14 


Case 3: Fitting prices with Differential Evolution 


In the two previous examples, we assumed that we already had a vector of zero rates, obtained, for 
instance, from a bootstrapping procedure. Now we shall calibrate the model parameters directly to 
a vector b™ of bond prices. As in yM, the superscript stands for “market.” 


Aim: to find parameters for the NS/NSS models such that theoretical bond prices b, computed from 
y(t), are close to observed prices b”. 


The path from a set of parameters of the NS/NSS model to theoretical bond prices is straightforward: 


1: For given parameters, compute yield curve from NS/NSS model; 
2: with this yield curve, calculate theoretical bond prices; 
3: compute discrepancy between theoretical bond prices and observed bond prices. 


So our problem is 


min||b — b” | 
BA 


subject to 


1 TI 1 T2 
b= | ——_— oe ee 
(aa) a (aa) ma 


a 1 — exp(—t/a1) Da Eq. (16.8) for NS, 
y= t Al ch | T Eq. (16.9) for NSS 
fb, >0 


Bi + B2 > 0 


[time] 


[cf-matrix] 
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ài >0 (@=1 for NS, i= 1,2 for NSS). 


As noted before, a bond is a collection of dates and associated payments. So, in R a bond comprises 
a vector cf of payments, such as [4.25, 4.25, 104.25], and a vector tm of times-to-payment, such 


as [1, 2, 3]. Using actual dates, we have something like the following snippet: 


= Ge <> GU 25, 4.25, 104,25) 
= mats <= E(UAOLG=LO=12", YAO SLO=12", VAOLA=1O=12") 


> ## compute time to maturity in years 
> today <- as.Date("2010-05-31") 
> tm <- as.numeric((as.Date(mats) - today) )/365 


For given zero rates y corresponding to the payment dates, we can price the bond by computing a 
vector of discount factors, and then computing the inner product of these discount factors and the 
cash flow vector. This we will have to do for all bonds in our data set, and many times over. Let us 


first organize our data. 


The package NMOF includes a data set bundData, which contains information on 44 German 


government bonds. bundData is a list with three components: 


cfList isa list of length 44 with cash flow vectors for the bonds. Each element of the list is a 


numeric vector. 


tmList isa list of length 44 with payment dates associated with cfList. Each element of the 


list is a character vector, with dates formatted according to ISO 8601 as YYYY-MM-1 


DI 


D. Thus 


we can easily coerce them to the Date class with as . Date. The dates are the contractual 
payment dates. If such a date is a weekend or a holiday, the payment will usually be made 
on the next business day. Hence, cash flows may occur later than stated, which makes such 
a bond slightly less attractive than when evaluated at contractual payment dates. In practice, 


people usually correct for non-business days. 


bM is a vector of market prices as of 31 May 2010. These are dirty prices, so they include accrued 


interest. 


It is convenient to store all cash flows in a single matrix, as in Eq. (16.5). 


> cfList <- bundDataScfList 

> tmList <- bundDataStmList 

> mats <- unlist(tmList, use.names = FALSE) 
> mats <- sort (unique (mats) ) 

= 


ISIN <- names (bundData$cfList) 


## set up cash flow matrix 
nR <- length(mats) 
nc <- length(cfList) 
GENea == eiren O, lim = elms, ale) )) 
icone (5) na Seara) 
cfMatrix[mats %in’ tmbist[[jl], j] <- cfhistiij]] 
> rownames(cfMatrix) <- mats 
> colnames(cfMatrix) <- ISIN 


WOE WEN ONY 


The matrix cfMatrix should look as follows: 


ahead (e Matriz AEBS) 


DE0001135150 DE0001141471 DE0001135168 
2010-06-20 0 0 0 
2010-07-04 105 0 0 
2010-09-20 0) 0 0 
2010-10-08 0 102 0 
2010-10-10 0) 0 0 
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cfMatrix corresponds to C’; see Eq. (16.6). Each column of cfMatrix stores the cash flows of 
one bond; the rownames of cf{Matrix indicate the time of payment. For a given set of NS/NSS 
parameters, we compute a vector of zero rates y for the payment dates; then this vector can be 
transformed into discount factors Af with 


> CHE <= il (U a 37) en) 
from which we can compute bond prices with 
> b <- df %*% cfiMatrix 


In this way we compute all the bond prices in one step. Finally, we can compare these prices b with 
the market prices bM. OF2 is an example for a objective function in which we compute the maxi- 
mum absolute difference between theoretical and observed prices. The complete example follows. 
Note that the penalty function is the one we already used in the previous example. 


## reprice bonds with known yield curve 

today <- as.Date("2010-05-31") 

tm <- as.numeric((as.Date(mats) - today) ) /365 
IQsceulNwin <= G(5, =2, il, L0; i, 5) 

yM <- NSS(betaTRUE, tm) 

Che <= a 7 (al = a OO A n) 

bM <- df %*% cfMatrix 


We NE NE N ENE 


Vv 


OF2 <- function(param, Data) { 

tm <- DataStm 
bM <- DataSbM 
model <- Data$model 
cfiMatrix <- DataScfMatrix 
df <- 1 / ( (1 + model(param, tm) /100)*tm ) 
b <- df %*% cfiMatrix 
aux <- max(abs(b - bM)) 
ie {( aL} melewes)]) 

1e5 
else 

aux 


> ## collect all data in ‘Data’ 
> Data <- list( 
bM = bM, 
tm = tm, 
etMacrcix = cfMatrix, 
model = NSS, 
pen.w = 1, 
ma = el O,=15,=—30,=30,0,3)) , 
mee = elds, S30, S30, 20,35, 5) ) 


> ## list of parameters for DEopt 
> algo <- listi 


ia? = ILO), 
nG = 600, 
ie = 0.5, 

Gx = ©.9), 


mane = El 10),—1 5), 310), — 3107, O 
me = G(s, 30, 30, IMION; 
pen = penalty, 
repair = NULL, 
loopOF = TRUE, 


[bond-example] 
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loopPen = FALSE, 
loopRepair = FALSE, 
printBar = FALSE) 


> system.time(sol <- DEopt(OF = OF2, algo = algo, 


Differential Evolution. 


Best solution has objective function value 1.25e-10 


Data = Data) ) 


P 


standard deviation of OF in final population is 1.16e-10 


user system elapsed 
1.19 0.00 1 AT 


> ## maximum yield error 
> max(abs(Data$model(sol$xbest, tm) - 
Data$model (betaTRUE, tm) )) 


[1] 4.32e-10 


> ## max. abs. price error and obj. function 
> ## ==> should be the same 

> df <- 1 / ((1 + NSS(sol$xbest, tm) /100)“*tm) 
> b <- df S*% cfMatrix 

> max(abs(b - bM) ) 


[1] 1.25e-10 


> solsSOFvalue 


[1] 1.25e-10 


Let us compare the results with nlminb. See Fig. 16.11. 


Note that we have actually not used the bM that are contained in bundData, but have repriced 
the bonds with a known term structure. This allows us to see if the algorithm works: we should be 


able to achieve a pricing error of zero. 


x 
£ 
2) 
D 
x) 
> 
true yields 
4 F —— DEopt 
niminb 
34 1 L 1 1 f 1 
0 5 10 15 20 25 30 


maturities in years 


FIGURE 16.11 Results for Case 3, Nelson—Siegel—Svensson (NSS) model: True model yields, yield curves fitted with 
DEopt, and yield curves fitted with nlminb. Plotted are the results of 5 restarts for both methods, though for DEopt it is 


impossible to distinguish between the different curves. 
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We may sometimes find that there are still relatively large remaining errors in the short-term 
zero rates. For instance, for the example given above, we can plot the errors in the interest rates and 


prices with the following code example. 


Vv 


Vv 


do.call(par, par.nmof) 
parmar = e(3, 8B, it, i), 
we = CHAsa, Os; 0))) 


Vv 


## plot rate error against ttm of payment [errors] 


> plot(tm, NSS(solSxbest, tm) - NSS(betaTRUE, tm), 
ylab = EEr rors Ini aeelicers” , 
xlab = "Time to maturity") 
0.10 3 
8 0.08 tO 
© 
z 0.06 
o 0.04 
© 
fT 0.02 


0.00 


0 5 


do.call(par, par.nmof) 
parmar = CB; Th, il, 4b)))) 
plot(c((as.Date(sapply(tmList, 
e¢(b - bM), 
ylab = 


Vvvyv 


8 0.005 p on 

5 (0) a 
z 0.0007, O 8 

5 oo 

G-0.005' p @ 


## plot price error against tm of bond 


Mnieictores| ma joreLeos” 
Selle) = Ve O meetai) 


20 25 30 
Time to maturity 


[errors2] 


= Tosk LISS) y 


20 25 30 


Time to maturity 


There are two reasons for these larger errors. First, we did not ask the algorithm to minimize 
differences in rates, but prices. And the price differences are numerically small: typically smaller 
than 107? in the given example. Which is economically small given that we measure prices in 
percentage points; the errors are smaller than a basis point. 

There is a second reason. For short-dated bonds, any reasonable interest rate gives a good ap- 
proximation of the price. Short bonds have a low duration: changes in the interest rate have little 
impact on their prices. This argument in reverse implies that a small change in the bond price may 
bring about a large change in the interest rate. For our sample of bond prices, this is just an arti- 
fact; if we use the obtained yield curve to price the bonds, the computed prices will be fine. But if 
you need the rates to price other instruments, this may cause problems (or at least discomfort). To 
overcome this problem, we can weight the different bonds (e.g., by the inverse of time to maturity, 
or by inverse duration); or we may link yields from one day to the next, so as to avoid large jumps. 


[yield] 
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We can also calibrate our parameters to yields-to-maturity. This latter approach is used by several 
central banks (e.g., Switzerland’s SNB; see Miiller, 2002). In the next section we see how to solve 
such a model. 


Case 4: Fitting yields-to-maturity with Differential Evolution 


Aim: to find parameters for the NS/NSS models such that the yields-to-maturity r of theoretical 
bond prices, computed from y(t), are close to observed yield-to-maturity r™. 


Compared with the last example, we now need one further step to move from a given set of NS/NSS 
parameters to an objective function value: 


1: For given parameters, compute yield curve from NS/NSS model; 

2: with this yield curve, calculate theoretical bond prices; 

3: compute theoretical yields-to-maturity for theoretical bond prices; 

4: compute discrepancy between theoretical yields-to-maturity and observed yields-to-maturity. 


More formally, the problem now becomes 


min||r — r” | 
B.A 


subject to 


_ 1 — exp(—t/a,) Da Eq. (16.8) for NS, 
WO) = Bi +l |+ P (16.9) for NSS 


pi >0 


Bi + Bo > 0 
4; >0 (i= l1 for NS, i = 1,2 for NSS). 


The main downside of this approach is that it is more expensive than directly fitting the bond prices. 
In every iteration we now need to compute the internal rate of return (see page 488) for every bond. 
With 20 to 50 bonds, an optimization can then easily take more than a minute, compared with less 
than a minute before. 

We first write a function to compute yields. 


> compYield <- function(cf, tm, guess = 0.05) { 
iIny S— CUNEO n Cit, em) SNe ” (iL =) WA) Sica) ) 
non.zero <- cf != 0 
cf <- cf[non. zero] 
tm <- tm[non. zero] 
ytm <- guess 


h <- 1le-8 
dF <- 1 
ci <- OL 


while (abs (dF) > le-5) { 
ci <- ci + 1L 
aie (ea, S Sih) 
break 
FF <- fy(ytm, cf, tm) 
GFF <- (fy(ytm +h, cf, tm) - FF)/h 
dF <- FF/dFF 
yem s= yew = Cle 
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} 
ytm 
} 


We use Newton’s method, as above; again we do not use analytic derivatives. We exit the algo- 
rithm’s while structure if Newton has not converged after 5 steps. Converged means that the 
difference between two consecutive guesses is smaller than 0.00005. 

The objective function: 


> OFyield <- function(param, Data) { [OFyield] 
tm <- DataStm 
rM <- Data$rM 
model <- Data$model 
cfMatrix <- DataScfMatrix 
nB <- dim(cfMatrix) [2L] 
zrates <- model(param, tm) 
aux <- leg 
df <- 1/((1 + zrates/100) A E 
Is) <6 CHE eaa SEMENE os 
r <- numeric (nB) 
mac CEARN EnSan p Che <. al, che & @, Is) & apy 4 
for (bb in seq_len(nB)) { 
wie (gg s= L) 
guess <- 0.05 
else 
guess <- r[bb - 1L] 
r[bb] <- compYield(c(-b[bb], cfMatrix[, bb]), 
c(0, tm), guess) 


} 
aux <- abs(r - rM) 
aux <- sum(aux) 


aux 


Because Newton’s method is sensitive to “strange” parameter values, we have added several checks: 
if a price is smaller than 1 for instance, we just set the objective function to a large number. Note 
that the price in cfMatrix are sorted by time-to-maturity, so we would expect yields of adjacent 
bonds to be similar. Thus, when we loop over the bonds, we choose the prior bond’s yield as the 
initial guess for the current bond’s yield. 

The following listing shows the complete example. 


cfList <- bundDataS$cfList [yield-example] 

tmList <- bundData$tmList 

mats <- unlist(tmList, use.names = FALS 

mats <- sort (unique (mats) ) 

ISIN <- names (bundData[[1]]) 

## set up cash flow matrix 

nR <- length(mats) 

nc <- length(cfList) 

cfiMatrix <- array(0, dim = c(nR, nC) ) 

ror a Sera) 
Gaias hieis a esel dh, ail e sase E 

rownames (cfMatrix) <- mats 

> colnames (cfMatrix) <- ISIN 


ES 


VEN ie VV So ENEN 


Vv 


> ## compute artificial market prices 
today <- as.Date("2010-05-31") 
> tm <- as.numeric((as.Date(mats) - today) )/365 


Vv 


[ns-ex4-errors] 
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> betaTRUE <- c(5,-2,1,10,1,3); yM <- NSS(betaTRUE, 
= eie so n fF (aL Fe ay 100) ei) 
> bM <- df %*% cfMatrix 
> rM <- apply(rbind(-bM, cfMatrix), 2, compYield, c( 
> ## collect all in dataList 
> Data <= list ( 
EMCS TM, 
tm = tm, 
cfiMatrix = cfiMatrix, 
model = NSS, 
ma = e @, =1S, =30, =30, 0 , 2.5), 
mex = @(15, 30, 30, 30, 2.5, 5), 
pen w = ori 
) 
> ## set parameters for de 
> algo <- list( 
np = SOL, 
nG = 500L, 
E = @.50, 
CR = 0.99), 
ma = ek 0, =15, =30, =30, © , 2.5), 
mes = e(15, 30, 30, 30, 2.5, 5), 
pen = penalty, repair = NULL, 
loopOF = TRUE, loopPen = FALSE, 
loopRepair = FALSE, 
printDetail = TRUE, 
printBar = FALSE) 
> system.time(sol <- DEopt(OF = OFyield, 
algo = algo, 
Data = Data) ) 


Evolution. 


Differential 


Best solution has objective function value 0.00127 ; 
standard deviation of OF in final population is 8.53e-07 


system elapsed 
0.0 15.3 


user 
153 


Again, we plot the results; see Fig. 16.12. 


> ## maximum error DEopt 


> max(abs (Data$model(sol$xbest, tm) 


[1] 0.00214 


> 

> df <- 1 / ((1 + NSS(sol$xbest, tm) 
> b <- df %*% cfiMatrix 

> 6 <— apply (cbind(((—b, “GiMaltrix); 2); 
Be 


sum(abs(r - rM)) 


[1] 0.000189 


- Data$model (betaTRUE 


/100) *tm) 


compYield, c(0, 


Op an )) 


## maximum abs. yield error and objective function DEopt 


tm) ) 


tm) ) ) 
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FIGURE 16.12 Results for Case 4, Nelson—Siegel-Svensson (NSS) model: True model yields, yield curve fitted with 
DEopt, and yield curve fitted with nlminb. Plotted are the results of 5 restarts for both methods, though for DEopt it is 
impossible to distinguish between the different curves. 


> solSOFvalue 


[1] 0.000189 


> ## nilminb 
> s0 <- algo$min + (algo$max - algoSmin) x 
runif (length (algoS$min) ) 

> system.time(sol2 <- nlminb(s0O, OFyield, Data = Data, 
lower = algoSmin, 
upper = algoSmax, 
control = list(eval.max = 50000L, 

iter.max = 50000L))) 


user system elapsed 
0.001 0.000 0.001 


> ## maximum error nlminb 
> max (abs (Data$model(sol2Spar, tm) - DataSmodel(betaTRUE, tm) ) ) 


Pah 2253, 


> ## maximum abs. yield error and objective function DEopt 
a che <= a oY (i = INSS (Sole Soay tem) / i100) tem) 

> b <- df $*% cfiMatrix 

Sie <= gio Gaona (=o, ementi]; 2, Cemoxyacilel, GO, Ee] 

> sum(abs(r - rM)) 


[1] 0.564 
> sol2Sobjective 


[1] 1e+08 
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16.1.4 Experiments 


In the examples above, we always used artificial data. Because we knew the true parameters, we 
also knew how well or badly the algorithms worked. In this section we describe several experiments 
in such settings. These are not meant as once-and-for-all references, but they should give ideas on 
how to test implementations, and how to do diagnostic checking. The code for these examples is 
included either in this chapter’s R code file or in the NMOF manual (Schumann, 201 1—2018a). 


The objective function 


We fix parameters for the NS/NSS model and a vector of payment dates. From these, we compute 
a vector yM. Then we run DEopt and try to find parameters such that max (abs (y-yM) ) (or 
some other function) is minimized. We know that ideally, the objective function should reach zero. 
Specifically we set 


= tm <- e(il, 3, O, GY, 1a, 15, Us, Ail, 24, 30, 36, 
aea (NO, VA, Ch, SNS, ILO), ZOA 


and we draw parameters uniformly from the following ranges. (In this example, if a parameter 
combination leads to negative interest rates, we discard it and draw afresh.) 


parameter minimum maximum parameter minimum maximum 
By 0.1 10 M 0 10 

Bo —ßı +0.1 10 AQ 0 10 

B —10 10 

Ba —10 10 


We are not interested in parameter identification here, only in fitting the model yields to the true 
yields. We draw 1000 parameter sets. For each parameter vector, we compute the corresponding 
yM, and pass this vector to the DEopt function. Each time, we record the maximum absolute error 
max (abs (y - yM) ) (e.g., 0.2 means we have an error of 0.2 percentage points). 

As an example, we fix the parameters of DE as follows: F is 0.8, and CR is 0.8. We use a 
population size of 20, and run DE for 100 generations. (If you know DE, you will at once object 
to these settings; but they serve to make a point.) Since we have 1000 parameter sets, we get 
1000 solutions. Fig. 16.13, in its left panel, shows the distribution function of the absolute errors. 
We also add the distributions obtained after 500 generations, and after 1000. 

For only 100 generations, the results are bad. The median error is about 0.2% with a maximum 
error of sometimes a whole percentage point; essentially no run reaches a zero error. But for the 
runs with more generations, we get much better results. There are a few outliers, but apart from that 
the distribution really converges on zero. 


1.07 1.0, 
0.8 } 0.8} 
0.6} 0.6} 
Generations: 
0.4 f 1000 iii 
0.2} 500 0.2} 
i . ee 
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 02 03 04 05 
Maximum absolute error in % Maximum absolute error in % 


FIGURE 16.13 Left: Convergence of solutions. Each distribution function shows the objective function value of the best 
population member from 1000 runs of DEopt. As the number of generations increases, the distributions become steeper 
and move to the left (zero is the optimal objective function value). Right: Convergence of population, i.e. objective function 
values across members at the end of DEopt run. The light-gray distributions are from runs with 100 generations; the darker 
(and steeper) ones come from 500 generations. Those latter distributions typically converge: all solutions in the population 
are the same. 
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The solution from a DE run is the best member of the population. What about the other mem- 
bers? In the right panel of the figure, we have plotted a first indication of why we obtained bad 
results with 100 generations. The panel shows not solutions, but the final distribution of objective 
function values across the population. The light gray distributions are those from 100 generation 
runs; the darker ones (the steeper ones) come from 500 generations. After 500 generations, the 
population has converged much more. This is typical for DE: the members of the population all 
move to the best point found. In other words, if the objective function values across members are 
vastly different, this indicates too few generations. The following test can be used: do 100 runs of 
DE, and for each run extract not only the best member, but also the standard deviation (or range) of 
population members. Then look at the correlation between the objective function value of the best 
member and the standard deviation of the populations. If algo$printDetails is set to TRUE 
(the default) for DEopt, the function will print the standard deviation of the final population’s 
objective function values. 


In the example here, with 1000 generations, the populations converge for all practical purposes. 
We have about 90% of all solutions with an error smaller than one basis point. But importantly: 
while non-convergence is often a sign that the number of generations is too small, convergence 
does not guarantee that we have found the global optimum, usually just a local minimum. Of 
course, in our setting we know that the true optimum is zero, so we know that we have found the 
global optimum. 

We could increase ng further, but there is often a better approach: we can try to find more- 
appropriate values of F and CR, and np. If we need to repeatedly solve a specific optimization 
problem, we should try to find good values for the parameters of a method. This is again an opti- 
mization problem, and not an easy one since the objective function is noisy. But we do not need the 
best possible parameters—remember that the robustness of the method with respect to parameter 
settings was one defining criterion for a heuristic. Rather, we can run experiments to find “good” 
parameter values. Instead of a (pointless) “F should be 0.62,” we look for answers like “F should 
be around 0.7-0.9.” 

The trouble is that to test all possibilities, the number of experiments we would have to run 
grows exponentially in the number of parameters of the heuristic. For DE, we need to set F and 
CR; it also matters how we distribute the function evaluations between np and ng. Practically, this 
is rarely a problem since, as said before, we do not really want to optimize, we only want to make 
sure we choose reasonable parameters. And in many applications, even inferior parameter settings 
can often be cured by just letting the algorithm search longer (i.e., by choosing a bigger population 
and more generations). The computing time lost is often only a fraction of the time one can spend 
on testing alternative settings. 

Let us run some experiments to find reasonable parameter values. We fix the number of 
function evaluations, and run the algorithm with different parameters. Possible value for CR are 
{0.3, 0.5, 0.7, 0.9}. Lower values, say 0.1, would not make much sense for this problem. Recall that 
in the NSS model, we have six parameters to estimate. With a CR of 0.1, the probability that a given 
solution stays unchanged would be 0.9° ~ 53%. F can be {0.1, 0.3, 0.5, 0.7, 0.9}. 

To save space, we do not plot the empirical distributions but use quartile plots. Quartile plots 
can be thought of as “reduced-form” boxplots (Tufte, 2001). They only print the median (the dot in 
the middle) and the whiskers of the boxplot. The following figure illustrates their construction. 

Q, stands for the qth quantile, the interquartile range IQR is Q75 — Qos. As in the standard 
boxplot, the limits of the whiskers are Q25 — 1.5 x IQR (or, if it is greater, the smallest observed 
value), and Q75 + 1.5 x IQR (or, if it is smaller, the greatest observed value). Quartile plots are 
described in Tufte (2001, Chapter 6). The package NMOF contains the function q?ab1e that helps 
to generate the table (but generally needs some hand-formatting). gTable uses the picture 
environment in I4TRX, so no further IATEX packages are required. 
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Table 16.2 gives an impression of how the parameters affect the solution quality. The parameter 
F should be rather small, CR should be high. A good combination is a high CR (say, 0.9), and a 
small to medium F (say, 0.3-0.5). 

Another useful check is to look at the population over time. A solution returned from DEopt 
contains a matrix Fmat that stores the objective function values of all members of the population 
over all generations, thus the matrix is of size ng x np. It is useful to plot the results, for instance 
the average, best and worst solution over the course of the optimization. 


> plore (apply, (sollskmat, l, median) >) type) = mii) 
= iliinas(aojolkhy(solSmiiaic, i, miaj, voa = Til, ley = 3) 
= Isimas (eiojlhy(SolSianec, i, mes), evo = U4, ley = 3) 


Fig. 16.14 shows an example. The objective here was to minimize the absolute price difference 
between bonds; hence, the objective function is measured in percentage points. Both plots show 
the average (median), best, and worst solution for the same data set; but with different parameter 
settings. Left, we have F = 0.5, CR = 0.99; right, F = 0.9, CR = 0.5. We see that the parameter 
settings can have a large influence on the speed of convergence. 


Constraints 


We should also check if the penalties did their job. For a particular run, we can just look at the final 
solution and see if the returned parameters are feasible (or the penalty function evaluates to zero). 
But this is not much of a test: after all, we might have been lucky and the solution was within in the 
constraints. 
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FIGURE 16.14 OF over time. The left panel shows F = 0.5, CR = 0.99. For the same problem, the right panel shows F = 
0.9, CR = 0.5. 


TABLE 16.2 Errors in % for NSS model for different parameters of DE. Each block corresponds to F e€ {0.1, 0.3, 0.5, 0.7, 0.9} for a given F. FE stands for objective 
function evaluations. 


CR F 1000 FE (np = 20, ng = 50) 10,000 FE (np = 50, ng = 200) 120,000 FE (np = 200, ng = 600) 
Med. Min Max Med. Min Max Med. Min Max 
0.3 0.1 0.30 0.02 1.81 0.07 0.01 1.01 0.02 0.00 0.47 
0.3 0.35 0.02 1.81 0.09 0.01 0.93 0.03 0.00 0.50 
0.5 0.41 0.03 1.87 0.11 0.00 0.81 0.03 0.00 0.46 
0.7 0.48 0.04 1.86 0.13 0.01 1.15 0.04 0.00 0.51 
0.9 0.58 0.04 aa 0.16 0.01 0.85 0.05 0.01 0.48 
0.5 0.1 0.17 0.01 1.85 0.03 0.00 0.76 0.01 0.00 0.57 
0.3 0.23 0.02 1.51 0.05 0.00 0.80 0.01 0.00 0.57 
0.5 0.30 0.02 1.88 0.07 0.00 1.03 0.03 0.00 0.53 
0.7 0.43 0.03 2.05 0.10 0.02 0.94 0.03 0.00 0.56 
0.9 0.58 0.08 1.99 0.14 0.02 1.16 0.05 0.00 0.58 
0.7 0.1 0.10 0.01 1.87 0.02 0.00 wO 0.00 0.00 0.57 
0.3 0.11 0.01 1.69 
0.5 0.20 0.02 1.54 
0.7 0.35 0.04 1.74 
0.9 0.59 0.05 23 } 
0.9 0.1 0.23 0.00 2.26 0.06 0.00 1.78 0.01 0.00 0.58 
0.3 0.04 0.00 1.37 0.00 0.00 1.03 0.00 0.00 0.57 
0.5 0.07 0.01 1.37 0.00 0.00 0.98 0.00 0.00 0.57 
0.7 0.21 0.02 RZS 0.02 0.00 0.72 0.00 0.00 0.57 
0.9 0.52 0.09 2.24 0.07 0.01 1.24 0.01 0.00 0.53 


0.5 


0.5 
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FIGURE 16.15 The true value of 6, is 5. Left: without constraint (penalty weight is zero). Right: with constraint £] < 4. 


A simple way to test a restriction is to set true parameters outside the feasible ranges, and then 
see if the algorithm halts at the parameter boundaries. This is advisable for any method we might 
use since it helps us to spot “unexpected behavior” (possibly caused by errors in the code). For 
penalties, it is important because it helps us to find sufficient weights for the penalties. 

An example: the true parameters for the NSS model are c (5, -2, 1, 10, 1, 5). Weare 
mainly interested in £4, the long-run rate is at 5%. Assume we have a very strong view that this 
rate should be positive but below 4%, so, 4 > 6; > 0. We run DE once without constraints, and 
once with a penalty; the population size is 50 and we run for 200 generations. Fig. 16.15 shows on 
the y-axis the values of 6; for the members of the population, the x-axis gives the generation (we 
included a bit of jittering). 

Adding the penalty clearly forces the parameter value inside the feasible boundaries. There is 
another remarkable observation: in both cases, we initialized the population within the feasible 
ranges, so initial 6,-values were all between 0 and 4. But in the unconstrained case, DE had little 
trouble moving the population outside this band and finding the correct values. 


16.2 Robust and resistant regression 


Linear regression is a widely used tool in finance. It is, for instance, common practice to model asset 
returns as a linear combination of the returns of various types of factors. Such regressions can then 
be used to explain past returns, or in attempts to forecast future returns. In financial economics, 
such factor models are the primary tools in empirical asset pricing. Another area of application 
is risk management where the regression estimates can be used to construct variance—covariance 
matrices. There is considerable evidence of the usefulness of such models in this context (Chan et 
al., 1999). 

Regression models may not only be used to inform financial decisions by analyzing assets; we 
may also directly construct portfolios with them. For instance, one approach to replicate a portfolio 
or an index is to find investable assets whose returns “explain” the chosen regressand (e.g., the 
index return); see, for instance, Rudolf et al. (1999). We could also solve other portfolio problems 
with a regression: assume we have p assets, and let the symbol x; stand for the return of asset i at 
some point in time; we use x;* for the excess return over a constant risk-free rate. If a risk-free asset 
exists, mean-variance portfolio optimization reduces to finding the portfolio with the maximum 
ratio of excess return to portfolio volatility. This optimization problem can be rewritten as 


1 = Ox} + O2x3 + +--+ Opx, +e 


where 6; are the coefficients to be estimated and € holds the errors. Estimating the 6; with Least 
Squares and rescaling them to conform with the budget constraint is equivalent to solving an un- 
constrained mean-variance problem for the tangency portfolio weights; see Britten-Jones (1999). 
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The mathematics of the approach are outlined in a appendix to this chapter. The following R-script 
illustrates the approach. 


> ## create artifical data (‘daily returns’ ) [tangency] 

Sim <= 100 # number of observations 

>p <- 10 # number of assets 

SO <> GIAceny aoan = jo, meca = 0- 00l Se = Oil), 
clin = Cia, S) 

> rf <- 0.0001 # riskfree rate (2.5% pa) 

>m <- apply(X, 2, mean) # means 

> m2 <- m - rf # excess means 

> ## (1) solve the problem with qp 

> library ("quadprog") 

> aMat <- as.matrix(m2); bVec <- 1 

> zeros <- array(0, dim = c(p,1)) 

> solQP <- solve.QP(cov(X), zeros, aMat, bVec, meq = 1) 

> # rescale variables to obtain weights 

> w <- solQP$solution/sum(solQP$solution) 

> compute sharpe ratio 

= IR <- t(w) S*3 m2 / sqrt(t(w) %*%3 cov(X) %*% w) 

> ## (2) solve with regression 

SS 32 <- X - rf # excess returns 

= oes <= aia (il, Chm = ea il) )) 

> run regression 

> solR <- Im(ones~-1 + X2) 

> # rescale variables to obtain weights 

> w2 <- coef (solR) 

> w2 <- w2/sum(w2) 

> ## (3) solve first-order conditions 

> w3 <- solve(cov(X) ,m2) 

> # rescale 

> w3 <- w3/sum(w3) 


> ## check they are the same 
> all.equal(as.vector(w),as.vector (w2) ) 


[1] TRUI 


EE 


> all.equal(as.vector(w),as.vector (w3) ) 


[1] TRUI 


GI 


> all.equal(as.vector(w2),as.vector (w3) ) 


[1] TRUI 


B 


We can also find the global minimum-variance portfolio by running a regression (Kempf and 
Memmel, 2006). We write the portfolio return as the sum of its expectation jz and an error €; hence, 


u +e = bix + O2x2 +++» + OpXp. 
Imposing the budget constraint }* 0 = 1 and rearranging we get 


Xp =U +0 (Xp — x1) + O2(Xp — x2) +: + Op-1 Xp —Xp-1) +€. 
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We can directly read off the portfolio weights from the regression; the weight of the pth position is 
determined via the budget constraint. 


## create artificial data with mean 0 and sd 5% 

n <- 100 ## number of observations 

p <- 10 ## number of assets 

NG SS enren aonn (inh <2 joy; Meema = 0), eyel = 0) 5 0)5))),, 
Chin = (ie, 12) ) 


[minimum-var] 


> 
= 
r 
Be 


## (1) solve with QP 

library ("quadprog" ) 

aMat <- array(1, dim 

bVec <- 1 

zeros <- array(0, dim = c(p, 1)) 

solQP <- solve.QP(cov(X), zeros, t(aMat), bVec, meq = 1) 

+E cco iacl Chack solucion 

all.equal (as.numeric (var(X %*% solQPSsolution)), 
as.numeric(2 * solQPSvalue) ) 


tb; 12) 


Me aN MONON RV NV 


[1] TRUI 


Gl 


## (2) regression 

Ww <—S Fe, AL ## choose 1st asset as regressand 
X2 <- X[, 1] - X[, 2:p] ## choose 1st asset as regressand 
solR <- Im(y ~ X2) 

## compare results of regression with qp 

## __ weights from qp 

as.vector(solQP$solution) 


Whe WP ONE WE VEO NE 


[1] 0.0835 0.1728 0.1132 0.1097 0.0997 0.1011 0.1084 
[8] 0.0462 0.0784 0.0869 


> # weights from regression 
> as.vector(c(1l - sum(coef(solR) [-1]), coef(solR) [-1])) 


[1] 0.0835 0.1728 0.1132 0.1097 0.0997 0.1011 0.1084 
[8] 0.0462 0.0784 0.0869 


> ## variance of portfolio 
> all.equal(as.numeric(var(X %*% solQPS$solution)), 
var (solR$residuals) ) 


[1] TRUE 

> ## (3) solve first-order conditons 

> x <- solve(cov(X), numeric(p) + 1) ## or any other constant != 0 
> ## rescale 

> x <- x/sum(x) 

> ## compare results with QP solution 

> all.equal(solQP$solution, x) 


[1] TRUI 


Gl 
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Running a portfolio optimization as a regression is an instructive exercise, but it is rarely the 
most convenient tool to compute portfolios. It becomes more difficult to include constraints, and the 
real problems (such as getting the data right) remain; see Chapter 14. The main insight comes from 
the fact that if a problem can be written as a regression, it will inherit the characteristics (in partic- 
ular, the weaknesses) of a regression. As an example, it is well known that with multi-collinearity 
it becomes difficult to identify regressors. Since a portfolio selection problem can be written as a 
regression, it will suffer from the same problem. With high correlation between the assets—which 
is the normal case—it becomes more difficult to “identify” (i.e. accurately compute) weights. 

Linear models are also used to evaluate the ex-post performance of investment managers. Since 
Sharpe (1992), so-called style analysis has become a building block in performance measurement 
and evaluation. The regression coefficients are then interpreted as portfolio weights and the residu- 
als as managerial skill (or luck). 

The standard method to obtain parameter estimates for a linear regression model is Least 
Squares (LS). LS has appealing theoretical and numerical properties, but the resulting estimates 
are often unstable if there exist extreme observations—and these are common in financial time se- 
ries (Chan and Lakonishok, 1992; Knez and Ready, 1997; Genton and Ronchetti, 2008). A few or 
even a single extreme data point can heavily influence the resulting estimates. Fig. 16.16 presents a 
concrete example of what can happen with data errors. A much-studied example is the estimation 
of -coefficients for the CAPM, where small changes in the data (resulting, for instance, from a 
moving-window scheme) often lead to large changes in the estimated 6-values. Earlier contribu- 
tions in the finance literature suggested some form of shrinkage of extreme coefficients towards 
more reasonable levels, with different theoretical justifications (see, for example, Blume, 1971; 
Vasicek, 1973; Klemkosky and Martin, 1975). An alternative approach, which we will deal with 
in this chapter, is the application of robust or resistant estimation methods (Chan and Lakonishok, 
1992, Martin and Simin, 2003). 

To make the point clear: we do not suggest ignoring known errors in the data. The question is 
whether we always find the errors. 

There is, of course, a conceptual question: what exactly is an extreme observation or outlier in 
a financial time series? Extreme returns may occur rather regularly, and completely disregarding 
such returns by removing or winsorizing them could mean ignoring information. Errors in the 
data, though, for example stock splits that have not been accounted for, are clearly outliers (as in 
Fig. 16.16). Such data errors occur on a wide scale, even with commercial data providers (Ince and 
Porter, 2006). In particular if data are processed automatically, alternative techniques like robust 
estimation methods are advisable. 

In this section, we will discuss the application of robust estimators. Such estimators were spe- 
cially designed not to be influenced too heavily by outliers, even though this characteristic often 
comes at the price of low efficiency if the data actually contain no outliers. (Efficiency here means 
sampling efficiency, that is, an estimator is more efficient than another one if it has lower sampling 
variance.) Robust estimators are often characterized by their breakdown value. In words, the break- 
down point is the smallest percentage of contaminated (outlying) data that may cause the estimator 
to be affected by an arbitrary bias (Rousseeuw, 1997). While Ls has a breakdown point of 0%, other 
estimators have breakdown points of up to 50%. Unfortunately, the estimation becomes much more 
difficult, and for many models only approximative solutions exist. We will describe the application 
of a heuristic, Particle Swarm Optimization, to such problems. 


16.2.1 The regression model 
We consider the linear regression model 
01 
y= [xi s+ Xp] : +€. 
Op 


Here, y is a vector of n observations of the independent variable; there are p regressors whose 
observations are stored in the column vectors x ;. We will usually collect the regressors in a matrix 
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(A) The adjusted price of Adidas according to finance . yahoo.com. 
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FIGURE 16.16 The effects of a single outlier. On 6 June 2006, Adidas, a large German producer of clothing and shoes, 
made a stock split 4-for-1. The data was obtained from www.yahoo.com in April 2009. According to www.yahoo.com the 
series in the upper panel is split adjusted. It is ironic that this is even true: but the price was adjusted twice, which resulted 


in the price jump in June 2006 (topmost panel). 


X=[|x sae 


x pl and write 6 for the vector of all coefficients. The jth coefficient is denoted 0j. 


We will normally include a constant as a regressor; hence, x; will be a vector of ones. 
The residuals e (i.e., the estimates for the €), are computed as 


e=y— Xô 


where Ê is an estimate for 6. Least Squares (LS) requires minimizing the sum or, equivalently, the 
mean of the squared residuals; hence, the estimator is defined as 


n 
p ya 2 
Ls = argmin — = ej. 
6 n 


i=l 


The advantage of this estimator is its computational tractability; the LS solution is found by solving 


the system of normal equations 
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(X'X)0 = X'y 
for 6° 
Rousseeuw (1984) suggested replacing the mean of the squared residuals with their median. The 
resulting Least Median of Squares (LMS) estimator can be shown to be less sensitive to outliers 
than LS; in fact, LMS’s breakdown point is almost 50%. More formally, LMS is defined as 


OLMs = argmin median (e) ; 
6 


LMS can be generalized to the Least Quantile of Squares (LQS) estimator. Let Qq be the qth 
quantile of the squared residuals, that is 


Qq = CDF"! (q) = min{e? | CDF (e) >q}, (16.11) 


where q may range from 0% to 100% (we drop the percent sign in subscripts). Hence the LMS 
estimator becomes 


Lms = argmin Qs5ọ (e) ; 
8 
and more generally we have 
6Los = argmin Qq (e) ; 
8 


For a given sample, several numbers satisfy definition (16.11); see 
Hyndman and Fan (1996). A convenient approach is to work directly with the order statistics 


/ 
[e, eh) a A . For LMS, for instance, the maximum breakdown point is achieved not by min- 


|a p+1 
n= [3|+] 5 | (16.12) 


and minimizing etn) (Rousseeuw, 1984, p. 873). 


imizing Qs (e°), but by defining 


The Least Trimmed Squares (LTS) estimator requires minimizing the order statistics of e? up 
to some maximum order k. Formally, 


k 
ÊLTS = argmin I > ety : 
g i=1 
To achieve a high breakdown value, the number k is set to roughly |1/2(n + p + 1)], or the order 
statistic defined in Eq. (16.12). In practical applications—such as in Fig. 16.16— it is more common 
to use a value such [0.97]. 

LQS and LTS estimators are sometimes called resistant estimators, since they do not just reduce 
the weighting of outlying points, but ignore them completely. This property in turn results in a low 
efficiency if there are no outliers. However, we can sometimes exploit this characteristic when we 
implement specific estimators. 


16.2.2 Estimation 


Robust estimation is computationally more difficult than LS estimation. A straightforward estima- 
tion strategy is to directly map the coefficients of a model into the objective function values, and 


3. More often we directly solve the Least Squares problem X6 = y without explicitly forming X’ X; see Chapter 3. 


[PSopt.] 
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then to evolve the coefficients according to a given optimization method until a “good” solution is 
found. For LMS, for instance, we may start with a guess for the parameters 0 and then change 0 
iteratively until the median squared residual cannot be reduced any further. The difficulty with this 
approach arises from the many local minima that the objective function exhibits. Heuristic meth- 
ods deploy specific strategies to overcome such local minima. We will show how to use Particle 
Swarm Optimization for this problem (and compare it with another heuristic, Differential Evolu- 
tion). 

Since many resistant estimators essentially fit models on only a subset of the data, we may also 
associate such subsets with particular objective function values—hence transform the estimation 
into a combinatorial problem. An intuitive example is the LTS estimator. Since the objective is to 
minimize the sum of the k smallest squared residuals, we could also, for every subset of size k, 
estimate LS coefficients. The subset with the minimum objective function value will give us the 
exact solution to the problem. But such a complete enumeration strategy is infeasible for even 
moderately sized data sets, so we need again a heuristic. The subset approach is discussed in Gilli 
and Schumann (2010f). 


Implementing Particle Swarm Optimization 


Particle Swarm Optimization (PSO) was described in Chapter 12. For convenience, we repeat the 
pseudocode in Algorithm 64. An abbreviated implementation in R follows. 


= PSC, <= numeroa (Or, algo = list); sao) { 
mRU <- function(m, n) 
akore auina (im <3 idl) Chm = ein, ia) )) 
( 


mRN <- function(m, n) 
Macaw Canon (in = ta), Clim = ein, im) )) 
d <- length (algo$max) 
vF <- numeric(algoSnP) 
vF[] <- NA 
mP <- algoSmin + 
diag (algo$max - algoSmin) %*% mRU(d, algoSnP) 
mV <- algoSinitV * mRN(d,algoS$nP) 
cor (e mm i geulerocin>) 
War ls) == Cla t Slo acs) 
mPbest <- mP 
vFbest <- vF 
sGbest <- min(vFbest) 
sgbest <- which.min(vFbest) [1] 
for (gq) an JallgosnG) 4 
mDV <- algo$cl*mRU(d, algo$nP) » (mPbest - mP) + 
algo$c2*mRU(d, algo$SnP) » (mPbest[, sgbest] - mP) 
mV <- algoSiner * mV + mDV 
logik <- mv > 0 
mV[logik] <- pmin(mV, algo$maxvV) [logik] 
logik <- mv < 0 
mV[logik] <- pmax(mV, -algo$maxV) [logik] 
me == MP + my 
for (s in 1:algoS$nP) 
War [S| <= ORW Sil, sos) 
is.better <- vF < vFbest 
mPbest[, is.better] <- mP[, is.better] 
vFbest[is.better] <- vF[is.better] 
if (min(vF) < sGbest) { 
sGbest <- min(vF) 
sgbest <- which.min(vF) [1] 


} 
list(vPar = mPbest[,sgbest], 
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OFvalue = sGbest, popF = vFbest) 


Algorithm 64 Particle Swarm Optimization for robust regression. 


1: set np, ng and cy, C2 

: ere : (0) vey 9 O n 
2: randomly generate initial population P; and velocity v), i =1,...,mp 
3: evaluate objective function F; = fe), i=1l,...,np 


4: Phest = P©), Fbest = F, Gbest = min; (F;), gbest = argmin(F; ) 
i 
5: fork = 1 tong do 
6 for i = 1 tonp do 
(k-1) (k-1) 

T: A vi = cı u; (Phest; — P, ) + c2 U2 (Pbestgbest — P; ) 
8 v® = vD +A v; 

9¢ PPPD 4V 
10: end for 


11: evaluate objective function F; = fe), i=1,...,np 
12: for i = 1 tonp do 

13: if F; < Fbest; then Pbhest; = pe and Fbest; = F; 
14: if F; < Gbest then Gbest = F; and gbest =i 

15: end for 

16: end for 


17: return best solution 


The PSO algorithm is implemented as a function that is called with three arguments: 
= PSooc (Om, algo = INSE(), ase) 


(Note that the implementation is very similar to that of Differential Evolution.) We will collect all 
objects to be passed with the . . . ina list Data and always call 


> PSone (Ob, maligo i Data) 


OF is the objective function. 

algo isa list that holds the settings of the PSO algorithm (discussed below). 

Data is a list that holds the pieces of data necessary to evaluate the objective function. For a 
regression model, Data will typically contain the matrix X and the vector y. 


The pseudocode for PSO shows two nested loops, but we can vectorize the inner loop. The 
variables P, Pbest, and v in the pseudocode are matrices; the iterator i in the inner loop in Al- 
gorithm 64 subsets the ith column of such a matrix, that is, it points to the ith solution. We can 
compute all instructions in one step by adding the weighted matrices. The variable gbest indicates 
the best solution of Pbest, so the vector Pbestgpes: has to be expanded to make it compatible with 
P-D, We do not have to do this explicitly in R because we can use the so-called recycling rule; 
see Venables and Ripley (2000, p. 17). 


Example 16.1 Recycling vectors in R 


If A is a matrix and a is a scalar, then in linear algebra an operation like A + a is not defined. Rather, 
we need to multiply a by a matrix of ones that has the same size as A. But in MATLAB and R (and other 
vector-oriented languages), the command A + a is accepted and executed. In R, this idea of making 
matrices automatically compatible goes even further. Assume A is of size m x n and b is a vector of 
length m. Then R will consider A + b as equivalent to A + bi’, where « is a vector of ones of length n. As 
an illustration, see the following code snippet. 


[PSopt] 
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> m<- 3 
> nh == 7 
= IN <= Arrey O Claim = (im, a 
=> Is) <= isin 
>A 

[EI Le2l feed CrAl T51 t6] 1479 
BN 0 0 0 0 0 0 0 
[2,] 0 0 0 0 0 0 0 
[sei 0 0 0 0 0) 0) 0 
>A+b 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 
LL] 1 1 1 1 1 1 1 
[2,] 2 2 2 2 2 2 2 
[3,] 3 3 3 3 3 3 3 


PSopt will return a list with several components, among them: 


xbest is the solution; the parameter vector with the lowest objective function value, 

OFvalue is the objective function value associated with xbest, 

popF is the vector with the final objective function values of the population members (i.e. OF- 
value equals min (popF) ), and 

Fmat is a matrix of size ng X np. It holds the objective function values of all solutions over 
time. 


Running PSO requires several ingredients: 


Initial population The initial solutions are drawn from a uniform distribution over given ranges. 
These ranges need to be specified through the vectors algoSmin and algo$max. These 
ranges serve two purposes: they tell the algorithm how many elements a given solu- 
tion has (length (algo$min) ), and over what range to initialize the population. The 
ranges are not constraints. If we want to use them in restrictions, we should pass them as 
Data$min and Data$max. Not specifying algoS$min and algo$max will produce an 
error. 


Initial velocity (initv) After we have initialized the population, we need to initialize the veloc- 
ities. The parameter initvV serves as a factor to randomly set the velocities (computed as 
initV x mRU(d,nP) ). In many cases, this parameter can be set to 0. 

Weights cy and c2 (c1 and c2) The weights. 

Inertia ô (iner) Inertia systematically reduces the velocity in each generation. 

Population size np (nP) The number of solutions (population size). 

Stopping criterion We use a simple stopping criterion: a fixed number of generations, given by 
the product ng. Altogether, we will have nP x nG objective function evaluations. 


These parameters are collected in the list algo and then passed to the function PSopt. 

When we discussed Differential Evolution in Section 16.1, we explained that sometimes it is 
possible to evaluate the whole population at once, which is usually faster than looping over the 
single solutions. Let us give a concrete example for the objective function (the same holds for a 
penalty or repair function). Assume we have a regressor matrix X of size n x p, and a popu- 
lation matrix P of size p x np (each column of P is one solution). A straightforward computation 
of the residuals for each solution would then be the following loop: 
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1: for i = 1 to np do 

2: compute residuals y — X P; 

3: evaluate objective function f for residuals 
4: end for 


Alternatively, we can compute X P in one step, and so obtain a matrix of residuals of size n x np. 
See the next script. 


> ## set up matrix X with n rows and p columns [vectorize] 
>n <- 100 

S> jo) <=. 5 

SS 2 a= eire Goao Gaso) o Chin = (ial, fo) 


> ## set up population P 
> nP k= 
> P <- array(rnorm(p*nP), dim = c(p, nP)) 


> OGI s= muneri oa O E) 
cine s= enren (O; Clink = iid, w) 
for (i in seq len(ncol(P))) 
cias abl) SS a l 
ans 
} 
> vect <— function (x, T E) 
X $*% P 
> library ("rbenchmark") 
> benchmark (loop(X, P), 
WAXCIE(DS, IE), 
oeer = UWieSileyeihya™, 
replicati = 1000) i 14 


test replications elapsed relativ 


2 vect(X, P) 1000 0.008 1.00 

1 loop(X, P) 1000 0.071 8.88 

> all.equal(loop(X, P), vect(X, P)) ## ... should be TRUE 
[1] TRUE 


Since we have an unconstrained problem, all we need is an objective function. For LQS, this 
could look like the following: 


> OF <- function(param, Data) { [of-1lqs] 
X <- Data$x 
y <- DataSy 
## as.vector (y) for recycling; param is a matrix 
aux <- as.vector(y) - X %*% param 
aux <- aux * aux 
cube s= adoles, 2, Sore, oeeie = DEATaASm) 
aux[Data$h, ] ## LQS 

} 


The function takes two arguments: candidate solutions param, and the list Data. The latter con- 
tains three pieces of information, the actual Data X and y, and the target residual h. For LTS the 
formulation is almost identical, we only need to change the last line in the code of the objective 
function. 
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> colSums(aux[1:DataSh, ]) ## LTS 


If param is a single solution, that is, vector of length p, OF will return a vector of length one. But 
param can also be a matrix of size p x np. In this case, the function will return a vector of length 
np. To see the difference in performance, we just need to set algo$ LoopOF to TRUE (see the code 
of the complete example). 

The most time-consuming part of the whole computation will be the sorting of the residuals. 
For both LQS and LTS, we do not need to sort all the residuals, but can make use of the partial 
argument of R’s sort function. 


x <- rnorm(101) 

sg) <> Gere (s<, isescemed, = 5il)) 
do.call(par, par.nmof) 

plot (ego, jel) = IS), eee = 025) 
middle <- xp[51] 

abline(h = middle, v = 51) 


[partial-sort] 


> 
= 
= 
= 
= 
= 


2 e 
e.’ 
e 
1 e Pa ki i ss, a 
Q 
x< 
(0) 20 40 60 80 100 


Index 


16.2.3 An example 


We will run an LQS regression as an example. As a benchmark we use the function 1qs in the 
MASS package (Venables and Ripley, 2002). We start with creating some random data. 


[createData] > createData <- function(n, p, 

constant = TRUE 
sneme = 2, 
orrae = M lh 1 

X <- array (rnorm(n*p), dim etas II 

if (constant) 

Ly, ali) == al 

b <- rnorm(p) 

y <- X *% b + rnorm(n)+*0.5 

nO <- ceiling(oFrac«n) 

when <- sample.int(n, nO) 

X[when, -1] <- X[when, -1] + rnorm(nO, sd = sigma) 

Isis (Gs = OS, y = ay) 


} 
> n <- 100 ## number of observations 
> p <= 10 ## number of regressors 
> constant <- TRUE ## include constant in model? 
> sigma <- 5 ## sd of outliers 
> oFrac <- 0.15 ## fraction of outliers in data 
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We put X and y into Data, and also the residual h that we wish to minimize. 


> h <- 70 ## ...0r use something like floor((n+1)/2) 


The settings for PSO and DE. 


> tmp <- createData(n, p, constant, sigma, oFrac) 
> X <- tmpsx 
> y = GNOE 
> Dewees <- liste = y, 
I S Wp 
iny = Iau) 


> popsize <- 100 
generations <- 400 
= jos <- listio = ras(>10, 19) 4 
mecs = zaa CA, joy), 
Gil = Oy, 
CF = 2.0, 
waer = (0) 413}, 
ianrey = OO 
nP = popsize, 
nG = generations, 
maxV = 3, 
loopOF = FALSE, 
printBar = FALSE) 
> de <- list(min = rep(-10, p), 
max rep (COPD) 
nP = popsize, 
nG = generations, 
Ip = (OA, 
GR = 0.5, 
loopOF = FALSE, 
printBar = FALSE) 


Vv 


> system.time(solPS <- PSopt(OF, ps, Data) ) 


Particle Swarm Optimisation. 
Best solution has objective function value 0.275 ; 
standard deviation of OF in final population is 1.32 


user system elapsed 
0.636 0.000 0.637 


> system.time(solDE <- DEopt (OF, de, Data) ) 


Differential Evolution. 
Best solution has objective function value 0.256 ; 
standard deviation of OF in final population is 0.00723 


user system elapsed 
0.613 0.000 0.613 


> library ("MASS") 

> system.time(test1 <- lqs(y ~ X[, -1], 
AUS = USER, 
nsamp = 100000, 


[settings] 


[robust] 
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user 
0.585 


system elapsed 
0.000 0.585 


mechodi = Wiles , 
quantile = h)) 


as.matrix(coef(test1)))%2) [h] 
as.matrix(solPS$xbest) )*2) [h] 
as.matrix(solDESxbest) )*2) [h] 
Nal”, 


Sep) = " m4) 


> resl <- sort((y - X &*% 

> res2 <- sort((y - X B*% 

> res3 <- sort((y - X &*% 

= EATV Valles s UP resi a 
"Sejrs Ty ZESA, TaT, 
DEOD A seexsis} EENT 

las: 0:352 

PSopt: 0.275 

DEopt: 0.256 


Finally, we can run the three methods. Note that we have set the number nsamp of samples in 1qs 


relatively high so that the functions have roughly similar computing time. 


There are two points that we would like to emphasize. First, many algorithms that have been 
based on sampling; hence, these algorithms also are stochastic. 
The algorithm behind 1qs is a point in case. Thus, the results obtained with such algorithms should 
discussed in Chapter 12. (See also Zeileis and Kleiber, 2009.) 
When we change the number of samples for 1qs, we can improve the solution. Thus, we also have 
a trade-off between solution quality and computing time. The following code provides an example. 


suggested for robust estimators are 


be analyzed just like heuristics, as 


Results are shown in Fig. 16.17. 


> ## n <- 100 ## number o 
> ## p <- 10 ## number o 
> ## constant <- TRUE; 

> ## h <- 70 ## 

> ## 

> ## X <- auxSX 

> ## y <- auxSy 


trials <- 100 
resl <- numeric(trials) 


We ONS 


= arom (e ain ilsiciereuls)) ( 
modl <- lqs(y ~ X[, 
adjust = 
nsamp = 
method = 


resi[t] < sort((y - 


> res2 <- numeric(trials) 
= ier (ie im ibeiereneuls)) i 
modl <- lqs(y ~ XI, 
ecguse = 
nsamp = 
method = 
quantile 
res2[t] <- sort((y - 


> res3 <- numeric (trials) 
= cor (ie im ibeiereaeuls)) i 


sigma <- 5; 
or use something like floor((n+1) /2) 
aux <- createData(n,p,constant,sigma,oFrac) 


f observations 
f regressors 


orree <= 0.15 


=d liy 
TRU] 
‘bes 
Pleas 


ei eh 


, 
, 
, 


quantile = h) 


X $*% as.matrix(coef(modl)))%*2) [h] 


=i) 4 
TRUE, 
10000, 
Higsa, 
h) 


X $*% as.matrix(coef(modl)))%*2) [h] 
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FIGURE 16.17 Objective function values obtained with 1qs with increasing number of samples (see code example). 


modl <- lqs(y ~ X[, -1], 
adjust = TRUE, 
nsamp = 100000, 
method = ‘lqs’, 
quantile = h) 
res3[t] <- sort((y - X %*% as.matrix(coef(modl1)))%*2) [h] 


> summary (cbind (default = res1, 
nsamp.10k = res2, 
nsmap.100k = res3)) 


default nsamp.10k nsmap.100k 
Min. 70.331 Min. 70.266 Min. 70.264 
ist Ou. 20.377 ist Ow. 20.356 ist Qu. 0.315 
Median :0.398 Median :0.372 Median :0.331 
Mean 20.398 Mean 0.369 Mean 20.327 
3rd Qu.:0.418 3rd Qu.:0.384 3rd Qu.:0.338 
Max. 20.467 Max. 20.427 Max. 20.360 


Second, in line with the theme of the book, before you let PSO and DE run with ever more 
generations, think about why you use LMS or another estimator. LMS is useful to find data errors 
(recall Fig. 16.16). But for that, we often do not need a very precise solution. In particular, for LTS 
there exists a simple algorithm called fastLTS that is practically extremely powerful (Rousseeuw 
and Van Driessen, 2005). So why use heuristics? The advantage of heuristics comes when we want 
to the change the standard setup. For example, when we have constraints. Or, though we have 
chosen a linear model, the heuristics could just as well be used with nonlinear models—we just 
need to write another objective function. 


16.2.4 Numerical experiments 


As we discussed in Chapter 12, almost all heuristics are stochastic algorithms. PSO is no excep- 
tion, so restarting the algorithm several times for the same data set will result in different solutions. 
We characterize a solution 6 by its associated objective function value, and treat such a solution 
obtained from one optimization run as the realization of a random variable with an unknown dis- 
tribution D. For a given data set and a model to estimate (LMS in our case), the shape of D will 
depend on the particular optimization technique and its parameter settings, and, in particular, on the 
amount of computational resources spent on an optimization run. Heuristic methods are specially 


[summary] 


[gendata] 
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designed so that they can move away from local optima, so if we allow more iterations, we would 
expect the method to produce better results on average. In fact, for an ever increasing number of 
iterations, we would finally expect D to degenerate to a single point, the global minimum. In prac- 
tice, we cannot let an algorithm run forever, hence we are interested in the convergence of specific 
algorithms for finite amounts of computational resources. “Convergence” thus means the change 
in the shape of D when we increase the number of iterations. To analyze D, we fix the settings for 
a method (data, parameters, numbers of iterations) and repeatedly restart the algorithm. Thus we 
obtain a sample of draws from D, from which we can compute an empirical distribution function 
as an estimate for D. 

In the following section, we will describe experiments conducted in Gilli and Schumann 
(2010f). We will compare two different heuristics, DE and PSO, for LMS regression. We define 
computational resources as the number of objective function evaluations. For DE and PSO, this is 
equal to the number of generations times the population size. This is justified for LMS regression 
since the overhead incurred from evolving solutions is small compared with the time necessary to 
compute the median of the squared residuals (which requires at least a partial sorting of the squared 
residuals). Fixing the number of function evaluations has the advantage of allowing us to compare 
the performance of different methods for a given amount of computational resources. 

We use the experimental setting described in Salibian-Barrera and Yohai (2006). They consider 
the regression model 


y=X0+e, (16.13) 


where X is of size n x p, 0 is the p-vector of coefficients, and € is Gaussian noise, that is, 
e~ N(0, 1). We always include a constant, so the first column of X is a vector of ones. The re- 
maining elements of X and y are normally distributed with a mean of zero and a variance of one. 
Accordingly, the true @-values are all zero, and the estimated values should be close to zero. We 
replace, however, about 10% of the observations with outliers. More precisely, if a row in [y X] is 
contaminated with an outlier, it is replaced by 


[M 1 100 0 ... 0] 


where M is a value between 90 and 200. This setting results in a region of local minima in the 
search space where 62 will be approximately M/100. The function genData can be used to create 
data sets. 


> genData < function(nP, nO, ol, dy) { 
## create data as in Salibian-Barrera & Yohai 2006 


## nP .. regressors 
## nO .. number of obs 
## ol .. number of outliers 
## Gy .. outlier size (’M’ in S-B&Y 2006): 90 to 200 
mRN <- function(m, n) 
aiae Anoan <2 ial), Clam = c(i, ia) )) 


y <- mRN(nO, 1) 
x <- cbind(as.matrix(numeric(nO) + 1), 
mRN (nO, nP - 1)) 
zz <- sample (nọ) 
a <= Conas, i100), 
ghaceny(@, Clim = eli, me = 2))))) 
ror (i am Leal) 
eels], I == z 
yaz lal] <> chy 
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Results 


All the methods employed require us to set a number of parameters. When we start with some 
problem, we will normally use “typical” parameter values. But all our results are conditional on 
the chosen values for the method’s parameters. An important part of implementing heuristics is, 
hence, the tuning of the algorithm, that is, finding good parameter values. This search is again an 
optimization problem: find those parameter values that lead to optimal (or, following this book’s 
theme, good enough) results in every restart. In other words, we want parameter values that lead to 
a “good” D. Since all methods need several parameters to be set, this optimization problem is not 
trivial; in particular, since the objective function has to be evaluated from simulation and thus will 
be noisy. Though this problem can be handled as an optimization problem, for our purposes here, 
we do not need such an optimization—in fact, it will be the opposite actually. Parameter setting 
is sometimes portrayed as an advantage, for it allows us to adapt methods to different problems. 
True. But at the same time it requires the person who wishes to apply the method to have a much 
deeper understanding of the respective method. In other words, the user will have to be a specialist 
in optimization, rather than in finance or econometrics. So what we would be much more interested 
in is to see if a method completely breaks down for certain parameters. 

To illustrate this point, we look at the model with p = 20 regressors, and solve it with different 
settings for the parameters. We set M to 150, and the number of observations n is fixed at 400. 
The number of function evaluations was set to 30,000. For every parameter setting we conducted 
1000 restarts. The results shown below are based on the same instance of the problem; thus, the 
results in the following tables are directly comparable for different methods. 

Table 16.3 shows the results when we vary F and CR. We include the median, best, and worst 
value of the obtained solutions. Furthermore we include quartile plots (Tufte, 2001) of the distribu- 
tions (see page 519); see function qTable in the NMOF package. 

The solutions returned by DE improve drastically when we set F to untypically low values 
while different choices for CR have less influence. With small F, we evolve the solutions by adding 
small changes at several dimensions of the solution. In a sense, then, we have a population of local 
searches, or at least of slowly moving individuals. 

Tables 16.4—-16.7 give the result for PSO; here the picture is different. While there are differ- 
ences in the results for different settings of the parameters, the results are quite stable when we vary 
ô, c1, and c2. Each table gives results for different values of cı and c2, with ô fixed for the whole 


TABLE 16.3 Parameter sensitivity DE. 


CR F median | best worst 0 0.5 1.0 5 2.0 
T T T T 1 
o2 0.2 0.47 0.424 0.507 = 
0.4 0.53 0.464 0.575 ae 
0.6 0.75 0.56 0.962 Sf 
0.8 1.54 0.988 2.08 SS 
0.4 0.2 0.44 0.399 0.472 = 
0.4 0.49 0.437 0.558 oe 
0.6 0.91 0.631 ilie) pep 
0.8 2.81 1.66 4.03 not pictured 
0.6 0.2 0.41 0.356 0.443 = 
0.4 0.48 0.41 0.512 oe 
0.6 19 0.848 1.88 =E 
0.8 5.36 2.35 7.73 not pictured 
0.8 0.2 0.38 0.338 0.432 -- 
0.4 0.48 0.409 0.523 = 
0.6 229 1.20 3.64 not pictured 
0.8 9.05 3.36 P not pictured 
0% 05 1.0 5 7 3.0 
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TABLE 16.4 Parameter sensitivity PSO for ô = 1. 


c2 cı median best worst 0 0.5 0 5 0 
eae Game ee S | T T T 1 
0.5 0.5 0.46 0.384 0.921 = 
1.0 0.45 0.376 0.944 = 
1.5 0.45 0.394 0.985 == 
2.0 0.45 0.399 0.938 == 
1.0 0.5 0.47 0.404 0.872 == 
1.0 0.46 0.391 0.91 == 
1.5 0.45 0.371 0.936 a= 
2.0 0.45 0.402 1.03 == 
1.5 0.5 0.46 0.406 0.96 == 
1.0 0.46 0395 0.89 == 
15 0.45 0.399 0.926 == 
2.0 0.45 0.402 0.829 == 
2.0 0.5 0.46 0.402 aes = 
1.0 0.46 0.39 1.01 = 
1.5 0.45 0.401 0.85 == 
2.0 0.45 0.392 0.833 == 
j y! 0.5 ( ) ) 


TABLE 16.5 Parameter sensitivity PSO for ô = 0.5. 


c2 c median | best worst 0 0.5 1.0 
(ee | SS ETN a ea ea ay ee eae 1 
0.5 0.5 0.61 0.416 1.23 = 
1.0 0.59 0.409 1.01 —. — 
1.5 0.59 0.419 0.935 — . — 
2.0 0.58 0.401 0.962 — : — 
1.0 0.5 0.57 0.385 1.09 =. = 
1.0 0.55 0.372 1.04 —: — 
1.5 0.54 0.366 0.854 —: — 
2.0 0.52 0.343 0.89 —: — 
1.5 0.5 0.53 0.353 1.03 = — 
1.0 0.53 0.361 1.05 —: — 
1.5 0.50 0.36 0.924 —:— 
AD 0.48 0.339 1.07 —:— 
2.0 0.5 0.50 0.348 0.933 —- — 
1.0 0.49 0.337 0.90 —: — 
1.5 0.46 0.331 0.867 —. — 


2.0 0.44 0.33 0.835 = 


table. The most salient result is that velocity should not be reduced too fast; hence, ô should be 
below but close to one. 

In sum, for this problem, PSO worked robustly for all kinds of parameter settings. DE on the 
other hand worked fine only if properly tuned. In particular, for typical values of F (in the range 
of 0.7-0.9, say) the method did not return good results. Rather we had to use small values, and 
ideally also for CR. 


What is a good solution? 

In their paper, Salibian-Barrera and Yohai (2006) analyze how often a given estimator converges 
to the wrong solution, that is, a biased 02. Such an analysis, however, confounds two issues: the 
ability of a given estimator to identify the outliers on the one hand, and the numerical optimization 


4. See, however, Maringer (2008a) where a calibration study with a different test problem is reported and where typical 
values for F worked well. It is possible that the parameter F needs to be linked to the dimensionality of the problem, akin to 
the situation for the Metropolis algorithm; see footnote 6 on page 146. 
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TABLE 16.6 Parameter sensitivity PSO for ô = 0.75. 


c2 C1 median best worst 0 0.5 1.0 5 2.0 
i eal T Ej) Å 1 
0.5 0.5 0.47 0.348 0.89 = 
1.0 0.46 0.339 0.923 = 
15 0.45 0.339 0.797 == 
2.0 0.43 0.327 0.806 == 
1.0 0.5 0.46 0.333 0.881 = =— 
1.0 0.44 0.324 0.822 = 
1.5 0.43 0.326 0.81 = 
2.0 0.41 0.327 0.80 eS 
1.5 0.5 0.43 0.328 0.834 a 
1.0 0.43 0.316 0.818 =e 
15 0.42 0.316 0.84 == 
2.0 0.42 0.338 0.847 ae 
2.0 0.5 0.42 0.332 0.818 a 
1.0 0.42 0337 0.878 =r 
15 0.43 0.327 0.774 =o 
2.0 0.44 0.358 0.873 oS 
( ( 5 i 5 0 


TABLE 16.7 Parameter sensitivity PSO for ô = 0.9. 


c2 c1 median best worst 0 0.5 1.0 5 2.0 
aes T T Pree. 
0.5 0.5 0.41 0.330 0.879 —— 
1.0 0.41 0.328 0.82 eo 
15 0.41 0.335 0.776 == 
2.0 0.42 0.348 0.766 —— 
1.0 0.5 0.42 0.335 0.913 --— 
1.0 0.42 0.332 0.884 — 
1.5 0.42 0.356 0.845 -— 
2.0 0.43 0.365 0.758 ao 
1 0.5 0.44 0.366 0.882 E 
1.0 0.44 0.361 0.83 -— 
1.5 0.44 0.367 0.781 = 
2.0 0.44 0.377 0.832 == 
2.0 0.5 0.45 0.375 0.79 = 
1.0 0.45 0.386 0.858 -:— 
1.5 0.44 0.38 0.922 -— 
2.0 0.44 


0.364 0.891 =e 


on the other. Since we can only control the optimization, we have not compared coefficients, but 
have looked at the value of the objective function. 

Let us dwell on this point for a moment. The aim of robust regression methods—to be outlier 
resistant—cannot be directly pursued: we cannot maximize the probability of identifying outliers, 
or minimize the probability of converging on a biased parameter value. Instead, estimators are 
evaluated theoretically by general characteristics such as the breakdown value (Rousseeuw, 1997) 
or the influence function (Hampel et al., 1986). Such properties are almost always of an asymptotic 
nature and offer little guarantee in finite samples. Hence, when we compare different estimators 
with one another in applications, there are two aspects that we should investigate: (1) what is the 
quality of the estimator when it comes to its actually desired function, namely to spot outliers, or 
to converge to correct parameter values?, and (ii) what is the quality of the optimization and how is 
finding a good numerical solution related to question (i)? 

These two notions of a good solution are not always distinguished in the literature. Rousseeuw 
and Van Driessen (2005, p. 35), for instance, refer to the “global optimum” as a subset of data 
that contains no outliers. In optimization, a global optimum would be the parameter values that 
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give the objective function its minimum value (beyond Monte Carlo settings, Rousseeuw and Van 
Driessen’s definition could never be checked). If an estimator fails to identify outliers, we need to 
investigate whether it is not just due to a poor optimization. On the other hand, even if the estimator 
does as expected (which, of course, is only testable in experiments), we need to know how relevant 
a good fit is. Assume our criterion of fit is a function (a norm) of the residual vector, and our aim is 
to identify outliers in a sample. Then ideally we would like to have a monotonous, or “threshold” 
behavior in the optimization: once we have found a solution better than a certain threshold, the 
model always identifies the outlying points. Unfortunately, it is not clear whether this is the case 
for specific estimators; here comes an example in which we do not have threshold behavior. 

We stay with the data-generating process of Salibian-Barrera and Yohai (2006), that is, 
Eq. (16.13), and look at a univariate regression (p = 2), so we only have a constant 0; and a slope 
@2 to estimate. We set M to 90, so the true 62 should be around zero, and the wrong value will be 
0.9. The following figure shows the objective function for different values of 6; and 62. Where the 
slope has a value of 0.9, we see a sharp “valley” in the surface, seen more clearly in the right panel. 


Dis ete EEE ee mions sx BER e.g 
1 5 PESEE EE Tee eee bee 
20+ ; : 
AX) ; ; 
J dy = ‘ : 
15 QO 5 104-------- Ye las Say § ea 
AS z 
104 AR ay S 
RAOR Q 
tl i -o á O Bi an 
, NY $4 4 H 
$ DR RNR PNR Oe = 
NN: RUINS SURRY HH 
0 0 RS 
2 2 ay Se BRENYN RINEN a 
4 Ost xd err aE Wane DaT aaa 
2 T T T if 
0 -2 0 2 4 
Constant -2 ` _9 Slope Slope 


The trouble is that this optimum is actually lower than the region associated with the true pa- 
rameters. 

This can be seen in Fig. 16.18. The graphs show the result of 500 runs of LMS and LTS for the 
univariate regression, but with varying computational resources. Thus, the solutions differ widely in 
their realized objective function values. In the left panels, we plot all solutions against the estimate 
62; in the right panels, we “zoom in” on the 100 best solutions. Clearly, getting better in terms of the 
objective function does not necessarily help in finding the correct parameter value. In Fig. 16.19, 
we repeat the whole procedure but with M = 150. Hence, the correct 62 is still zero, the wrong 
value is 1.5. We see a more satisfactory behavior: with a lower objective function, we essentially 
always find the correct parameter value. 


16.2.5 Final remarks 


In this section we have described how optimization heuristics can be used for robust regression, in 
particular, LMS regression. Both PSO and DE seem capable of giving “good” solutions to the LMS 
problem, even though the computational resources (i.e., number of function evaluations) would 
have to be increased drastically to make the distribution of outcomes collapse to a very narrow 
support. In other words, there always remains randomness in the solutions. It is difficult to judge 
the importance of this remaining randomness without a particular application. 

DE performed well for small models, but the obtained results were sensitive to the specific 
parameter settings once we estimated models with more coefficients, that is, once we went into 
higher dimensions. Properly tuned, DE often found very good solutions, also for larger models. 
But that does not change the fact that results were sensitive to “improper” parameter settings. PSO 
showed a more robust performance. Given this robustness, we may prefer PSO for this particular 
problem. But there are several points to be kept in mind. First, all results are conditional on the 
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FIGURE 16.18 True 6) is zero. The outliers lead to a biased value of 0.9. 
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FIGURE 16.19 True 6) is zero. The outliers lead to a biased value of 1.5. 
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chosen setup. Artificial settings as here are helpful to check algorithms since we know the true 
solution and we can also easily scale the size of problems. Yet if a model works well on artificial 
data, this may not be much proof that it works with real data (of course, if a technique fails in such 
experiments, we have a clear indication about its quality). 

Furthermore, while PSO performed well on average, some restarts returned low-quality so- 
lutions. It is difficult to judge the relevance of such outcomes: the errors that may occur from 
the optimization have to be weighted in light of the actual application, for example, a portfolio 
construction process. A suggestion for actual implementations is thus to “diversify,” that is, to im- 
plement several methods for a given problem, at least as benchmarks or test cases. At the very least, 
the algorithm should be restarted several times. 


16.3 Estimating Time Series Models 
16.3.1 Adventures with time series estimation 


Econometric software has been scrutinized repeatedly in the past. Notably, Newbold et al. (1994) 
published their experiences with software for autoregressive integrated moving average models 
(ARIMA) under the suitable title Adventures with ARIMA software. While comparing several then 
state-of-the-art packages in the estimation of a GARCH(1, 1) model, Brooks et al. (2001) found 
substantial differences between the outputs of the packages, concerning both parameter values and 
their standard deviations (and, hence, their significance). McCullough and Vinod (1999) is just one 
of several papers where these co-authors take a closer look at all sort of packages for data analysis. 
The typical conclusion in all of these contributions is: do not blindly trust the results that software 
packages report. 

From a user’s perspective, the usual suspects are coding issues. Despite being programmed 
with the best intentions, and having undergone careful review with respect to the developers’ con- 
siderations, packages can contain the odd bug, and even the most rigorous testing and verification 
procedures can miss it. Bugs can be typos in formulae, inappropriate error and exception handling, 
mixed-up variables, wrong sequences of operations, and many more. But bugs are not that frequent, 
at least in established packages. There must be other reasons why packages report different results 
than one another. Here are some: 


Computer arithmetic. Many estimation routines require matrix inversions, the solving of linear 
systems, or the computation of first derivatives, all to which the limitations of computer 
arithmetic discussed in the first part of this book apply. These include issues with floats 
(see Section 2.1): there are limits to “fine-tuning” since (a) these values are not genuinely 
continuous but sit on a grid, and (b) small differences in the parameters will not have a 
measurable effect on the objective value as precision gets lost during operations. So, the last 
digits of the parameters’ internal representation are therefore somewhat arbitrary. 

Choice of numerical methods. Model estimation always involves some form of search and opti- 
mization. It is not uncommon for the estimation package to call some default general-purpose 
optimization routine provided as part of the package. These routines usually use methods 
suitable for convex minimization problems, but not if there are local optima; even quasi- 
convex functions are no longer manageable in such cases, let alone ones with frictions or 
multiple optima. 

Conventions and assumptions. The perils in empirical work cannot be avoided by theory alone. 
Typical examples are missing values: Should the entire observation be dropped if just one 
(perhaps, even unused) variable is NA? What should be used for lags in the first observation 
of a time series model? Are constraints on parameters enforced? Ideally, there are gener- 
ally agreed answers or the package is flexible enough to accommodate modifications. The 
documentation, at least, should provide a description of what it does. 


With time series models, all three of these can arise. To get a better understanding, we look 
at one of the work-horse models for financial data, GARCH(1, 1). Admittedly, there are bigger 
challenges out there; but it is simple enough so that we can better see the aspects relevant from 
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a numerical and optimization perspective, and we hope that this provides a roadmap for more 
demanding models. 


16.3.2 The case of GARCH models 


Maximizing loglikelihoods 

In some models, like the ones in Section 16.2, it is easy to see if (and why) their estimation is a 
challenge to traditional optimization methods: in univariate regression, the mean squared error is a 
nicely convex function with respect to the constant term and slope, while the least median of squares 
is not. For other models, it is not quite so obvious. This is particularly true when the log-likelihood 
is part of the objective function. If a model y, = f(Q;|w) + er with information or regressors Q; 
and parameters y has normally distributed residuals, e; ~ N (0, 07), then the loglikelihood is 


T 


1 
= 2 X 2 
L= -5 logro )- 32 - er 


Estimating that model means finding the parameter values y that maximize L: 
A T 2 1 2 
Ý = argmax L(Y) = = 7 Oro’) — 5 Dvn — fA. 
t 


How difficult this estimation problem is hinges on the properties of the residuals and how they are 
determined. If f assumes a simple linear dependency between the ys and some exogenous variables 
xı, then finding Ñ is not much of a problem: In that case, L is quadratic in w, and the estimation 
requires solving a convex optimization problem. 

However, it does not take much to make matters more demanding, and time series models are a 
good example for that. In the moving average model (see Section 8.4.2), current observations are 
also affected by past shocks, implying that the residuals are autocorrelated: 


yt =pup+ Y erete > &=y hH X beere. 
£ g 


Because of its recursive nature, we can repeatedly substitute any lagged residual by the corre- 
sponding observation of y and the weighted preceding residual(s), and end up with a situation 
where lagged values y;_¢ enter with weights (—0)°. If we substitute this into the loglikelihood 
function, it becomes apparent that the objective function for the estimation problem is a higher 
order polynomial, and concavity (a crucial assumption in traditional numerical optimizers) is no 
longer granted. 

Likewise, if the variance is time varying, ee then the loglikelihood becomes 


T 1 2 
L=-7 nr) - 5 = (ine?) = s) , 


t t 


This is the case for the GARCH model, that has been introduced in Section 8.5.2. 


Setting up GARCH models 


GARCH models are a reasonable and popular choice for financial time series. Just as a quick 
reminder: Estimates of the current variance are assumed to be a weighted combination of previous 
variance estimates and the most recent realizations in the guise of previous squared errors, e? o In 
its simplest form, the actual time series {yz}, has constant mean jz and no other drivers: 


q p 
yt =u + er, where e; ~ N (0, of) and of = œo + 5 aee y + X beota 
t=1 l=1 
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For the sake of simplicity (and because it seems to be sufficient in many situations), let’s stick to 
the GARCH(1, 1) case with just one lag each (q = p = 1). 

Even in this very basic incarnation, there are a few things that need to be agreed on. For starters, 
it is not clear how to extract the e;s: are they forced to have a mean of zero or not? Meaning, is 
u replaced with the sample mean, y, and the residuals computed as e; = y; — y, or is u another 
parameter to estimate? 

Next, we need to agree on how to initialize the sequence of variances. The recursive definition 
according to the model does not work for the first observation since there is no observation for eg 
or ôg. In the absence of more information, one could use unconditional expected values. But even 
that is not without ambiguities: One could just simply set the first observation to be the uncondi- 
tional variance (ê? = E(e’)). One could also assume that this is the best guess for the unobserved 
preceding values of squared residual and variance (64 = êz = E (e?)). Or one could rely on the the- 
oretical properties and remember that the unconditional variance (for a sufficiently large sample) 
should be equal to ao/(1 — œı — £1), provided that a; + fı < 1. All three alternatives are valid, 
and any mainstream package relies on one of them. But they are not perfect, in particular if the 
first observations fall within a regime that has unusually high or low variance. It is an even bigger 
problem for short time series: there is just not enough time for the system to warm up. So, one 
could use e? as a manifestation of s?, but also consider estimating another parameter. 

Theory does give us some guidance for parameter values. The œs and fs should be non-negative. 
Otherwise a negative value for oP might result, and an imaginary number for volatility does not 
make much sense. There is also an upper limit on the parameters: the sum of the weights of the 
e? ¿S and of ¿S should be less than 1, otherwise o? would go to oo if we wait long enough. When 
working with large samples these restrictions are usually not a problem, and the constraints are 
satisfied more or less without paying attention. The second constraint (to avoid explosive problems) 
might become relevant, however, when data in the wake of a crisis are analyzed and volatility 
gradually builds up. And for short time series, outliers or other anomalies could also play a role. 
Most packages do not enforce these constraints: that is fine if one is interested in the in-sample 
properties, but less useful for predictions and (long-run) simulations. 


Numerical aspects 


The model parameters (i.e., the decision variables for the optimization problem) enter the objective 
function in the logarithmic and in the fractional term, and the recursive nature of a? doesn’t simplify 
matters. Next, if the non-negativity constraint on the parameters is imposed on negative values by 
making them zero (i.e., the lowest acceptable value) or absolute valued, the objective function will 
have kinks and discontinuities. For candidate solutions where the as and £s are rather close to zero 
(leading to a sequence {67} where all values are very close to zero, too), L will exhibit surprising 
behavior (again, numerical issues and computer algebra play a role). And if the residuals, too, 
are not just the centered values of y but the result of some regression model, with parameters to be 
estimated simultaneously, the underlying optimization problem turns into an even bigger challenge, 
defying the default optimization routines for strictly concave (or their strictly convex counterparts). 
This is why for simple settings, already, some packages relying on the default optimization routine 
(typically: gradient based) report their results with a disclaimer such as “false convergence” while 
others report different results if re-run with an altered sequence of regressors in the regressions 
model. 
This is a situation where heuristic methods can help, as the following examples will show. 


Available packages 


GARCH estimations can be performed with all major econometrics software packages and with 
many numerical general-purpose platforms. In fact, for R alone, there exist several packages that 
could be used for GARCH(1, 1); a quick web search will guide you to the most popular ones. Ex- 
periment with current packages and comparing their results is highly encouraged, as Brooks et al. 
(2001) had done. We also did, and it turned out that results are not always accurate and they are 
certainly not all equally reliable. In fact, we came across convergence warnings (in particular for 


Econometric models Chapter | 16 545 


higher order models), and found noticeable differences in the reported estimates, sometimes pre- 
venting us from replicating the reported results. We even came across a situation where a sequence 
of the additional explanatory variables for a linear model for y made a difference to the parameter 
estimates even though it obviously shouldn’t. However, we’d rather restrain from discussing them 
here individually: By the time you read this book, some of the more critical aspects might have been 
removed, and packages might have been updated or replaced with alternatives. All of the packages 
have been implemented with the best intentions, and it is not always clear what the pitfalls are and 
where to expect obstacles. Also, that is not to say that all the packages are flawed or useless; but it 
might be a good idea to double-check the solutions and estimates they report. 


16.3.3 Numerical experiments with Differential Evolution 


Data 


Not least due to Brooks et al. (2001), the time series with 1974 daily changes in the DeutscheMark 
to British Pound exchange rate (DEM2GBP) has become a popular benchmark data set for all sorts 
of time series models. We follow this tradition and derive all results in this section from that same 
data set. 


Setting up the estimation 


GARCH estimation is an optimization problem, and can therefore be approached with heuristic 
methods (see Maringer (2005b) and Winker and Maringer (2009)). The objective is to maximize 
the loglikelihood (LLH) for given data subject to the GARCH parameters. We add a slight twist 
to the usual problem by also estimating the initial variance which will serve as estimates for o? ', 
and Cg: The vector with decision variables, therefore, is y = [u, ög: &0, &1, ---, Æq, B1, -- -> Bpl- 
To make the LLH function more generic, 11hGARCH accepts params which is a list of the dif- 
ferent (vectors of) parameters; the function extractGARCHparams creates this list for a given 
candidate y (i.e., the vectorized sequence of parameters) and order. 


Listing 16.1: C-EconometricModels/R/./Ch14/GARCHDEopt.R 


1| require (NMOF) # load accompanying package 

2 

3| 11hGARCH = function(y, params) { 

4 q <- length(params$alpha) 

5 p <- length(params$beta) 

6 pqmax <- max(q,p) 

T T <- length (y) 

8 e2 = c(rep(params$s2init,pqmax), (y-params$mu) *«*2) 

9 s2 = e2 

10 

11 tObs <- pqmax + seq(1,T) # indices of actual observations 
12 for (t in (tObs)) { 

13 s2[t] = paramsSalpha0 + sum(paramsSbeta*s2[t+(-1:-p)]) + sum(paramsSalpha« 

e2[t+(-1:-q)]) 

14 } 

15 LLH = -log(2*pi) * T/2 - sum(log(s2[tObs]) + e2[tObs]/s2[tObs]) /2 
16 return(list(LLH=LLH, s2=s2[tObs] ) ) 

17} } 

18 

19| extractGARCHparams <- function(psi,order) { 
20 ifelse(order[1]>0, P<-3+(1l:order[1]), P<-NULL) 
21 ifelse(order[2]>0, Q<-tail(P,1)+(1:order[2]), Q<-NULL) 
22 params = list( mu = psi ll]; 
23 e2init. = psili? 
24 alpha0 = psi[3], 
25 alpha = psi[P], 
26 beta = psilQ]) 
2T return (params) 
28] } 
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A simple repair function can enforce the non-negativity constraint on all parameters, save m. If 
required we can also use it to fix the values of and og; this can be useful when comparing results 
generated by other packages or settings; just (un-)comment the corresponding lines in the code. By 
convention, the first and second element of the decision will be u and og, so the repair function is 
as follows: 


Listing 16.2: C-EconometricModels/R/./Ch14/GARCHDEopt.R 


32| repairGARCHparams <- function(psi, data=data) { 

33 psi[-1] <- abs(psi[-1]) #non-negativity for all except mu 

34 psi[l1] <- mean(dataS$y) # fix mu with E(y) 

35 psi[2] <- mean((data$y-psi[1])**2) # fix s_0^2 with uncond var. 
36 return (psi) 


Our default function for optimization with Differential Evolution, DEopt, minimizes and re- 
quires an objective function where the first input is a candidate solution, and the second is a list 
or dataframe with data and other relevant information. Assume the data are available in a variable 
DEM2GDP$y and we are interested in a GARCH(1, 1) model. Then the objective function OF for 
the problem at hand and data are 


Listing 16.3: C-EconometricModels/R/./Ch14/GARCHDEopt.R 


39| GARCHorder <- c(1,1) # (p,q) 
40| data <- list(y=y, order = GARCHorder) 


42| OF <- function(psi, data) { 

43 params <- extractGARCHparams (psi, dataSorder) 
44 LLH <- -11hGARCH(dataSy, params) $LLH 

45 return (LLH) 

46| } 


Lastly, we need to set up the search algorithm’s parameters and run the search. To be on the safe 
side, we choose a rather generous population size and number of generations. With that in place, 
the search can begin. 


Listing 16.4: C-EconometricModels/R/./Ch14/GARCHDEopt.R 


48] D <- 3+GARCHorder[1]+GARCHorder[2] # number of parameters 
49| algo <- list ( nP = 20L, ## population size 

50 nG = 1000L, ### number of generations 

51 F =0.7, ## step size 

52 CR= 0.5, ## prob of crossover 

53 min = rep(-1,D) , ### range for initial population 
54 max = rep(1,D) , 

55 repair = repairGARCHparams, 

56 minmaxConstr = FALSE, 

57 printBar = TRUE ) 

58| require (NMOF) 

59| sol <- DEopt (OF = OF , algo = algo, data = data) 


The loglikelihood of a GARCH(1, 1) is not concave, but it should be reasonably well-behaved, 
so that the heuristic method (in particular with the chosen parametrization) should not face severe 
problems. Nonetheless, repeated runs are recommended—not just to find the optimum, but also 
to learn more about the problem itself. Heuristic optimization is a random process. That makes it 
slower than deterministic methods, yet endows it with two main advantages: (i) It does not always 
converge to the same solution, making it unlikely to fall into the same local optimum as before, 
but rather, eventually, makes us aware of the existence of multiple optima from which the best, i.e. 
global, optima can be chosen. (ii) Not all reported results are the same, which tells us something 
about the uniqueness and diversity of the possible solutions. 
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Evaluating the optimal results for the benchmark case 


Based on the settings introduced above, we run 100 runs per model on a 64-bit operating system 
for the DEM2GBP data set with the default GARCH(1, 1) setting where u, ao, a1, and f; are to be 
estimated. e and og are both initialized with the unconditional sample variance of y, contingent on 
the candidate’s fi: s? = E((y — À)’). The quality of the result depends on the algorithm’s settings; 
Table 16.8 reports the median reported LLH and, in brackets, how often the overall optimum has 
been found. 


TABLE 16.8 Median reported loglikelihoods for 100 runs with dif- 
ferent numbers of generations (nG) and population size nP. (In 
brackets: Number of times the overall optimum has been found.) 


nG nP=10 nP=20 
63 —1106.98521899487 (0) —1106.72100030700 (0) 
125 —1106.60793073714 (0) —1106.60789452831 (0) 
250 —1106.60788104129 (68) —1106.60788104129 (99) 
500 —1106.60788104129 (95) —1106.60788104129 (100) — 
1000 —1106.60788104129 (98) —1106.60788104129 (100) 


With the most generous setting in terms of function evaluations (nP=20 and nG=1000), all 
of the 100 runs resulted in a reported maximum likelihood of —1106.60788104129. Or, put more 
specifically: that is the value according to the chosen precision when saving results into a .csv 
file; when reading it, it is converted to —1106.607881041290056601 as this is the closest value 
that can be represented as a float. This already highlights the first issue: due to the limitations in 
precision in the objective function, it is not possible to get an arbitrarily precise estimate of the 
parameters. In fact, it is not even possible to get unique results: for the 100 results with identical 
loglikelihood, there were not two with exactly the same parameter estimates: precision is lost when 
computing the objective, and beyond a certain position, additional decimal places in the parameters 
just do not matter anymore. On top, there is a tradeoff between certain parameters: a slightly higher 
6, in combination with a lower a. Fig. 16.20 illustrates this. In other words, there are 100 results 
that are all different in terms of parameter values (when looked at through a microscope), yet 
indistinguishable in terms of goodness-of-fit. So reporting estimated parameters beyond a certain 
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FIGURE 16.20 Reported parameters for the GARCH(1, 1) estimates from different restarts, all with identical (optimal) 
loglikelihood of —1106.60788 104129. 
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TABLE 16.9 Estimated GARCH(1, 1) parameters for the DEM2GBP time series and different 
model settings. Fixed values in brackets: in models B and C, y = E(y); in models A and B 


s? = E((y — ôP). 


model j ô? do oy Bi LLH 

A —0.0061904 (0.2211226)|} 0.0107614 0.1531341 0.8059737 —1106.6078810 
B (—0.0164268) (0.2210178) | 0.0106189 0.1510860 0.8083086 —1107.3381293 
C (—0.0164268) 0.0031964 0.0095918 0.1409065 0.8223335 —1103.6483336 _ 
D —0.0049176 0.001 1400 0.0097009 0.1428098 0.8202918 —1102.7243351 


precision does not make sense—beyond a certain decimal place, they are just random numbers. This 
fact is easily overlooked when using deterministic optimization routines: Even if they converge to 
the actual (globally) optimal result, they do not report the optimal estimates, but just one that can no 
longer be improved because precision has been exhausted. For that reason, even the same package 
run on a 32-bit and a 64-bit operating system can find different results. The moral is that it is good 
to have the highest possible precision when computing—but not to have a microscopic view of 
the final results. Truncating or rounding to a reasonable precision is fine. From a practical point 
of view, no relevant information is lost, and from a numerical point of view, it might be just noise 
anyway. 


Fixed versus optimized parameters 


As mentioned, some packages allow to fix the estimate for u or set it, by default, equal to the 
average of y; this enforces E(e) = 0. Fewer parameters to estimate is preferable if the quality of 
the result does not deteriorate (and it usually makes the search easier). And once we think about 
choosing Å freely, we can do the same thing for the unobserved constituents of ô? and initialize 
Gi and e? <0) either with the unconditional variance s or with another value, ae, to be estimated. 
This leaves us with four different models to compare. 

We used the same algo settings as in the previous setting and again 100 runs. For any of the 
models, the vast majority (93, 100, 98, and 87, respectively) of reported LLHs were identical to the 
highest LLH for that particular model, while reported parameters varied slightly due to numerical 
limitations discussed above. Table 16.9 reports parameter estimates rounded to a number of digits 
where the differences actually do matter and are no longer numerical artefacts. 

Allowing the model to choose more parameters freely must increase the loglikelihood if the 
fixed value is still available and the more flexible model nests the simpler. If the fixed value really 
is optimal, then it should still be picked and the LLH remains the same; if it isn’t and a better value 
can be found, then this will show in an increased LLH. Model A is the benchmark model from 
above; it is no surprise that it has a higher LLH than B where Å is fixed. Allowing to choose ôg has 
an even bigger positive effect on the LLH: The early observations obviously exhibit below-average 
volatility, and the unconditional variance would ignore that. At the same time, shocks have a slightly 
lower impact (lower å; ) yet with slightly longer memory (slightly higher Ê 1). 

The question is: is it worth including additional parameters given the perils of overestima- 
tion. To assess this, information criteria can help. When there are k parameters to estimate, 
Akaike’s information criterion, AIC = —2L + 2k, suggests keeping an additional parameter 
if the LLH goes up by more than 1. With k = {4,3,4,5} parameters to estimate and AIC = 
{2221.216, 2220.676, 2215.297, 2215.449}, model C (fix Ô = y but estimate a) seems to be the 
most favorable. The Bayesian information criterion, BIC = —2L + k ln(T) is stricter and requires 
the LLH to go up by more than In(T)/2. With BIC = {2243.567, 2237.440, 2237.648, 2243.388}, 
model B (fix both Ô and ô) seems to be marginally better than model C. 


Further applications in time series modeling 


Obviously, heuristic methods offer themselves to many applications beyond GARCH models. In 
smooth transition autoregressive (STAR) models (see Teräsvirta (1994)), the idea is to distinguish 


5. See also footnote 8 on page 166. 
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between two different regimes, each possessing their own set of parameters, but have a linear com- 
bination of the two that allows for a smooth switch or transition between them that is governed by 
some transition variable s. One representation of this model is 


y= Do bree + Glsile,y)> D> p2eyi-e + er 


Lely leLo 


with G(s;|c, y) = (1 + exp(—y - (s; —c)))7~!. No closed form solution exists to find c and y, so the 
literature suggests grid search. In addition, finding parsimonious lag sets £1 and £2 is desirable. For 
this acommon approach is to estimate the model up to a maximum lag, and then repeatedly exclude 
lags with significant parameters until only significant ones are left. Alternatively, the adjusted root 
mean squared error or a suitable information criterion could be optimized. Maringer and Meyer 
(2008) apply simulated annealing, threshold acceptance, and differential evolution to this problem. 
All methods are well suited to estimate y and c, but in particular, differential evolution seems well 
suited for uncovering suitable lag structures, too. Chen and Maringer (2011) and more recently 
Aussenegg et al. (2018) build on these findings and more general versions of this model to inves- 
tigate excess bond returns. The problem of lag selection and finding parsimonious structures has 
also been investigated in Winker and Maringer (2005) and Maringer and Deininger (2016) who use 
threshold acceptance and differential evolution, respectively, in the context of vector autoregressive 
(VAR) and vector error correction (VEC) models. 


Appendix 16.A Maximizing the Sharpe ratio 


Assume there are p assets, with expected excess returns (over the risk-free rate) collected in a vector 
x. The variance—covariance matrix of the assets’ returns is Q. Maximizing the ratio of excess return 
to portfolio volatility can be formalized as 


o'x 
max f 
0 VORO 


The first-order conditions of this problem lead to the system of linear equations 
x=Q0; 


see, for instance, Cuthbertson and Nitzsche (2005, Chapter 6). Solving the system and rescaling 0 
to sum to unity gives the optimal weights. 

Assume now that we have n observations; we define x to be the sample mean, and collect 
the p return series in a matrix X of size n x p. For the regression representation as proposed in 
Britten-Jones (1999), we need to solve 


t= X0* 


(which is an LS problem), where ı is the unit vector and the superscript * only serves to differentiate 
between 6 and 6*. 
This can be rewritten as 
Ly’; = 1x’ xo", 
n n 
=1X'X0*, 
n 
= 1 (Q + xx')0*. 


%1 


%1 


Applying the Sherman—Morrison formula (Golub and Van Loan, 1989, Chapter 2) allows us to 
show that 6* will be proportional to 6, and hence after rescaling we have 0* = 6. 
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Calibrating option pricing models 
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Option pricing models represent the price of the option as a function of the underlier; this 
underlier is then usually modeled via stochastic differential equations. Such differential equa- 
tions require fixing certain parameters, for instance, the volatility of the underlier. Choosing 
such parameters is usually called calibration. There exist different strategies to calibrate a 
model: 


e Weare given observed quantities, and try to find model parameters such that model output equals 
these quantities. This is the problem we will deal with in this chapter. This is useful when 
we use models to interpolate (and extrapolate) market prices: we start with a set of observed 
market prices and then set the parameters such that the model prices are close to these actual 
prices. 

e There are other strategies: the idea in option pricing is to model the underlier; then the option 
price is a function of this underlier. Thus calibration as described before means to “reverse- 
engineer” the market’s implied view about the underlier, forced into the dynamics (or just the 
language) of the model. But even if this model were a useful approximation of the world, we 
may find that option prices in the market do not reflect our own view of the world. Thus, we may 
also calibrate option-pricing models to the actual time series of the underlier, or a model of this 
underlier. 


Example 17.1 
The forward price F of an asset that does not pay any dividends is related to the spot price S by 


F = Se". 


r is the discount rate, and t the time to maturity of the contract. Assume the asset is an equity index that 
does not incorporate dividends (like the S&P 500, the EURO STOXX 50, or the Nikkei 225). Then we 
often assume that the dividend is paid continuously with rate q, thus we get F = Se"—”*. Suppose we 
can observe S and a future price F, we know zt, and assume we have a reasonable idea about r. Hence 
q, the implied yield, should be 


log(S) — log(F) +rt 
E , 


For futures on the same underlier, but different times to expiration, the implied yield q need not be the 
same. For instance, the EURO STOXX 50 is a basket of 50 underliers, and many of them pay dividends 
in the second quarter of the year. So we usually find a higher (annualized) dividend yield for contracts 
that cover that period. 
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Example 17.2 


Assume we have prices of a European put and a European call option for a given strike X and a given 
time to maturity t. Let the underlier pay one or several dividends during the lifetime of these options, 
and let the present value of the dividends be D. Then put-call parity links these together as 


Call +e™"X =Put+S—D. (17.1) 


If we model dividends as a rate q, put-call parity becomes 
Call +e" X =Put+e 1 S. (17.2) 


We can solve for D or q and so obtain dividends/dividend yields. The trouble is that with bid-ask 
spreads a range of dividends is feasible. 


17.1 Implied volatility with Black-Scholes 


In a Black-Scholes (BS) world, the stock price $, under the risk-neutral measure follows the 
stochastic differential equation 


dS; = (r — q) Sidt + JvS;dz; (17.3) 


where z is a Wiener process (Black and Scholes, 1973). Note that we use the subscript ¢ to denote a 
point in time (usually current time). The volatility ./v does not have a subscript; it is constant. The 
well-known pricing formula for the call under this model is given by 


Ci = e747 S, N(d1) — Xe~"* N(d>) (17.4) 


pea Gey k 17.5 
1 S; v =o 
d = (108 (>) + ( q z)e) =d; — 4 VT (17.5b) 


and N(-) the Gaussian distribution function. For the put, we can “flip around” Eq. (17.4), that is, 
we multiply by —1, and replace d1,2 by —d1,2. We can also use put—call parity, see Eqs. (17.1) and 
(17.2). The latter approach is convenient because it holds for any European call option, no matter 
how we computed the price. So we only need to price the call. The BS model is quickly coded with 
MATLAB®: 


with 


Listing 17.1: C-OptionCalibration/M/./Ch15/calIBSM.m 


function CO = callBSM(S,X,tau,r,q,v) 


1 

2|% callBSM.m -- version 2010-10-26 
3) % S = spot 

4|% X = strike 

5|% tau = time to mat 

6| & ra = riskfree rate 

7|% q = dividend yield 

8 = 


variance (vol squared) 


9| dl = (log(S./X) + (r - q+v/ 2) .* tau) ./ (sqrt(v * tau)); 
10| d2 = dl - sqrt(v*tau); 

11) CO = S.*exp(-q.*tau).*normcdf(d1,0,1) - ... 

12 X.*exp(-r.*tau) .*normcdf(d2,0,1); 


In R the function pnorm is the equivalent to MATLAB’s normcdf. 
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= Cells s= iiaclcslem(S, Xs, caw, ie, Ci, Wh L= i) 4 
## I .. 1 for a call, -1 for a put 
d1 <- (log(S/X) + (r - q + v/2) * tau) / 
Seice (etal) 
d2 <- dil - sgrt(v * tau) 
ie (iS a Sale a Well) e aE <3; Cll) = 
Moe erp Cr s+ tau) xs pnomm (i + d2) 
} 


The NMOF package contains this computation in the function vanillaOptionEuropean. 
The BS pricing function is also implemented in a number of other R packages. We have always liked 
RQuantLib (Eddelbuettel et al., 2018); see PEuropeanOption after attaching the package. 

Note that we have already given a pricing function for BS in Chapter 4; here we write the models 
in terms of variance v (not volatility). This is a convention chosen to be consistent with models that 
we describe later. 

Suppose we fix all arguments (S, r, ...) except the volatility, so let us write C(./v) for the price 
of a plain vanilla call. Under BS, this price is a monotonously increasing function of volatility; the 
same holds for the put. Thus, letting Cmarket be the observed price of the call, the difference 


C (V/v) — Cmarket (17.6) 


will have exactly one zero. (This is not generally true for other types of options, notably not for 
barrier options.) 

We can use any of the zero-finding methods described in Chapter 11 to locate the zero of the 
difference (17.6). Newton’s method seems particularly attractive. There is a saddle point where the 
option’s vega is maximized; see Manaster and Koehler (1982): 


2 S 
T 


this point can be used as a starting value for Newton’s method. We can directly use the function 
Newton0 given in Chapter 11.! 


Listing 17.2: C-OptionCalibration/M/./Ch15/computeIV.m 


function iv = computeIV(S,X,tau,r,q,start,C) 
% computeIV.m -- version 2010-10-24 

% x is volatility; start is x0 

diffF = @(x) callBSM(S,X,tau,r,q,x*2)-C; 

iv = Newton0(diffF,start) ; 


(17.7) 


nA WN eS 


A test: we fix a volatility, price an option with it, and then try to recover this implied volatility. 


Listing 17.3: C-OptionCalibration/M/./Ch15/exampleIV.m 


l|% exampleIV.m -- version 2010-10-26 
2}S = 110; % spot 

3}X = 80; % strike 

4)r = 0.09; % interest rate 

S5}q = 0.00; % dividend 

6| tau = 1; % time to maturity 

T 

8|% compute a market price 


9| trueVol = 0.3; 
10C = callBSM(S,X,tau,r,q,trueVol%2) ; 


12|% ... and try to get trueVol back 


1. The NMOF package contains a function vanillaOptionImpliedVol that computes implied volatility for options 
of both European and American type. 


[bsm] 
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13) start = sqrt (abs (log(S/X)+(r-q) *tau) *«2/tau) ; 
14| computeIV(S,X,tau,r,q,start,C) 


The result is as expected. 


ans = 
0.3000 


A few remarks: 


e Newton’s method is usually very fast; we should be suspicious if it takes more than, say, 10 steps 
to converge. This may happen, however, for deep in-the-money or out-of-the money options, and 
also for options with a long time until expiry. 

e There is no restriction on how to compute the price. We may use analytical formule, but other 
methods will work as well, for instance, a binomial tree. 

e The same holds for the derivative: we have used a finite difference to compute the vega, but we 
can use the analytic expression as well. Likewise, we could have used another method to find the 
zero, like bisection (to do so, exchange the function Newton0 in the code above with another 
function). Approximating the derivative is accurate enough, and in most cases Newton’s method 
is going to be faster than bisection. Of course, if we price by MC, we need to make allowances 
for randomness; see the discussion on Greeks, page 208. 

e The procedure is the same for European and American options. 


The smile 


BS is the standard option pricing model. Practitioners may not use it as it was intended, but it has 
become their language of choice. No other model will achieve this. In some products (for instance, 
currencies), option prices are actually quoted in implied volatility. The success of the BS model 
stems not so much from its empirical quality, but from its simplicity and—important for us—its 
computational convenience. This convenience comes in two flavors. First, we have closed-form 
pricing equations. True, the Gaussian distribution function is not available analytically, but fast 
and precise approximations exist. Second, calibrating the model requires only one parameter to be 
determined, the volatility. We have seen that this can be readily computed from market prices with 
Newton’s method or another zero-finding technique. So what is the trouble with BS? 

It turns out that implied volatilities obtained by inverting the BS model vary systematically 
with strike and maturity. This relationship is called the volatility surface. For a given maturity, we 
generally have a curved shape: the smile. Fig. 17.1 shows the implied volatilities of options as 
of the end of October 2010. Different strategies are possible for incorporating this surface into a 
model. We can accept that volatility is not constant across strikes and maturities, and directly model 
the volatility surface and its evolution. This is what most people in practice do. This assumes that 
a single underlier has different volatilities which does not make an internally consistent model. 
Nevertheless, model consistency is only a desirable byproduct, never the goal. 

An alternative is to model the option prices such that the BS volatility surface is obtained, for 
instance by including locally varying volatility (Derman and Kani, 1994, Dupire, 1994), jumps, 


Implied volatility in % 
Implied volatility in % 


0 . o 
Strike Time to maturity Strike 150 Time to maturity 
in months in months 


FIGURE 17.1 Implied volatilities for options on the S&P 500 (left) and DAX (right) as of 28 October 2010. 
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or by making volatility stochastic. In this chapter, we look into models that follow the latter two 
approaches, namely the models of Heston (1993) and Bates (1996). We first discuss how to price 
options under these models. For this, we describe how to implement an alternative pricing technique 
based on the characteristic function of the stock return. Then, we look into calibrating the models. 
Fast pricing routines are important since the heuristics that we use are computationally intensive; 
hence, to obtain calibration results in a reasonable period of time, we need to be able to evalu- 
ate the objective function (which requires pricing) speedily. Finally, we describe a computational 
experiment (from Gilli and Schumann, 201 la) and its results. 


17.2 Pricing with the characteristic function 


There are several generic approaches to price options. The essence of BS is a no-arbitrage argument; 
it leads to a partial differential equation that can be solved numerically or, in particular cases, even 
analytically. In this section, we will discuss another pricing technique based on the characteristic 
function of the logarithm of the stock price. We will be brief on the background of this method. 
In fact, there exist variations of this kind of pricing formula; see Carr and Madan (1999), Duffie 
et al. (2000), Lewis (2000), Chourdakis (2005), Fang and Oosterlee (2008). We have chosen the 
formulation of Bakshi and Madan (2000) because it is straightforward, and sufficiently fast. A good 
description of this pricing approach is given in the book by Schoutens (2003). 


17.2.1 A pricing equation 
European options can be priced by the following equation (Bakshi and Madan, 2000, Schoutens, 
2003): 

Co =e% Soll) —e "XII (17.8) 


where Co is the call price today (time t = 0), So is the spot price of the underlier, and X is the strike 
price; r and q are the risk-free rate and dividend yield; time to expiration is denoted t. The IT; are 
calculated as 


1 1 e 12 log (w — i) 
Mi pi fR do, 17.9 
i al U ioe . are 
0 
L17 —io log(X) 
T= 5+ | Re (mo) do. (17.9b) 
2 7 iw 
0 


We define IT; = (II; — 1/2) for the integrals in these equations. The symbol ¢ stands for the 
characteristic function of the log stock price; the function Re(-) returns the real part of a complex 
number. For a given ¢ we can compute IT; and I by numerical integration and, hence, obtain 
prices for the call from Eq. (17.8). For the put, we can use put—call parity. 


The Black-Scholes model 
Given the dynamics of S, the log price s+ = log(S,) follows a Gaussian distribution with s; ~ 
N (so +r(r—q— 1v), tv), where so is the natural logarithm of the current spot price. The char- 


acteristic function of s; is given by 


pes (o) = Ee”) 
: ; 1 loo 
= EXP | lwso + 1lwT ieee oa 5 a Ue 


1 
= exp (ios tire =aj=5 (io +”) rv). (17.10) 
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Note that in the literature the characteristic function is sometimes given for the log-return 
log($,/So), sometimes for the log price sz = log S+. Since 


on 

(0) — |= Sr Si 

2 So T 0 

E (ene =) = e1250 E oe) ; 


Inserting (17.10) into Eq. (17.8) should, up to numerical precision, give the same result as 


Eq. (17.4). For MATLAB, code for the characteristic function looks as follows.” 
Listing 17.4: C-OptionCalibration/M/./Ch15/cfBSMGeneric.m 


we have 


1| function cf = cfBSMGeneric(om,S,tau,r,q,param) 
2|% cfBSMGeneric.m -- version 2010-10-24 

3) vT = param(1); 

4| cf = exp(li * om x» log(S) + 1i * tau * (r - q) * ... 
5 


om - 0.5 * tau * vT x (1i * om + om .^ 2)); 


Note that we directly pass w (om), S, t, r, and q, and collect all other parameters (in this case, 
just the variance v) into a vector param. Also, it is good practice to use 1i when we want to 
use the imaginary unit i. The variable i (and also j) is defined as O + 1.000031 when we start 
MATLAB, but it is also a prime candidate for a loop counter variable and may often be overwritten. 


Merton's jump-—diffusion model 


Merton (1976) suggested modeling the underlier’s movements as a diffusion with occasional jumps; 
thus we have 


dS, = (r — q — Au J) Sidt + JvS,dz, + J, SAN; . (17.11) 


N; is a Poisson counting process, with intensity å; the J; is the random jump size (given that a jump 
occurred). In Merton’s model the log-jumps are distributed as 


log(1 + J) ~N (tox + py) - > Lw). 
The pricing formula is the following (Merton, 1976, p. 135): 


o0 gaT In 
ena OE oid (17.12) 


where A’ = 4(1 + wy) and Co is the BS formula (17.4), but the prime indicates that Co is evaluated 
at adjusted values of r and v: 


nlog(1 + 
Wersant ae bey) 


The factorial in Eq. (17.12) may easily lead to an overflow (Inf), but it is benign for two reasons. 
First, we do not need large numbers for n, a value of about 20 is well sufficient. Second (if we 
insist on large n), MATLAB or R will evaluate 1/Inf as zero; hence, the summing will add zeros 
for large n. Numerical analysts may also prefer to replace n! by exp()-;_, logi) since this leads to 


2. For R, see function c£BSM in the NMOF package. 
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better accuracy for large n. This will not cost us much, but again, for Merton’s model it is not really 
needed. 

Depending on the implementation, working with large values for n may lead to a warning or an 
error, and so interrupt a computation. In R, for instance, the handling of such a warning will depend 
on the options setting: 


> options () Swarn 


[1] 0 


This is the standard setting. Computing the factorial for a large number will result in a warning; the 
computation continues. 


> factorial (200) 


[1] Inf 


Warning message: 
In factorial(200) : value out of range in ‘gammafn’ 


But with warn set to 2, any warning will be transformed into an error. Thus, 


> options(warn = 2) 
> factorial (200) 


Error in factorial (200) 
(converted from warning) value out of range in ‘’gammafn’ 


and our computation breaks. We may want to safeguard against such possible errors. We can, for 
instance, replace the function call factorial (n) by its actual calculation which produces: 


> options(warn = 2) 
> exp( sum(log(1:200)) ) 


[1] Inf 
> prod(1:200) 


[1] Int 


Or even simpler, as in MATLAB’s implementation of factorial, we can check the given value 
of n; if it is too large, we have it replaced with a more reasonable value. 
The characteristic function of Merton’s model is given by 


PMerton = e^t? (17.13) 


where 


1 1 
A =iwso +iwt(r — q — au Amys) + shout 


1 1 
B = Ax (exp (iota + u) zV 5w) 1); 


see Gatheral (2006, Chapter 5). The A-term in Merton corresponds to the BS dynamics with a drift 
adjustment to account for the jumps; the B-term adds the jump component. Like in the BS case, we 
can compare the results from Eq. (17.8) with those obtained from Eq. (17.12). 


[warn] 
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The function cfMertonGeneric codes the characteristic function in MATLAB.° Note that 
the arguments are the same as for BS; param collects the four parameters necessary to specify the 
process. 


Listing 17.5: C-OptionCalibration/M/./Ch15/cfMertonGeneric.m 


1| function cf = cfMertonGeneric(om,S,tau,r,q,param) 
2|% cfMertonGeneric.m -- version 2010-10-24 

31% S = spot 

4|% tau = time to mat 

5|% r = riskfree rate 

63 q = dividend yield 

7% v = variance (volatility squared) 

8l% -- jumps -- 

9|% lambda = intensity; 

10| % mug = mean of jumps; 

11|% vo = variance of jumps; 

12| v = param(1); 

13| Lambda = param(2); 

14| mug = param(3); 

15| vo = param(4); 

16|A = lixom*log(S) + lixom*tau* (r-q-0.5*v-lambda+*muJd) 
17 - 0.5*(om.*2) *v*xtau; 

18|}B = lambdaxtaus (exp (1li*xom*log(1+muJ) 

19 -0.5*1li*xom*vJ-0.5*vJ*om.*2) - 1); 

20| c£ = exp(A + B); 


The Heston model 
Under the Heston (1993) model, the stock price S and its variance v are described by 


dS, = (r — q) Sidt + Tr Sdz” (17.14a) 
du, = k (8 — v, dt + o Judz™ . (17.14b) 


The long-run variance is denoted by 0, mean reversion speed is «x, and ø is the volatility-of- 
volatility. The Wiener processes z“? have correlation p. For ø — 0, the Heston dynamics approach 
those of BS. A thorough discussion of the model can be found in Gatheral (2006). The characteristic 
function of the log price in the Heston model looks as follows; see Albrecher et al. (2007). 


drteston = eA TBE (17.15) 
where 


A= iwso +iw(r — q)t 


0k ; 1 — ge% 
B= 5 (k — poiw — d)t — 2log 
o =8 
v 
5(« poiw d) (1 erat) 
c=2 
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d= oie — K)? +0? (iw + w?) 


k — piw — d 


s= poio} d 


3. For R, see function cfMerton in the NMOF package. The original pricing technique is implemented in function 
callMerton. 
4. For R, see function cfHeston in the NMOF package. 
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Listing 17.6: C-OptionCalibration/M/./Ch15/cfHestonGeneric.m 
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l| function cf = cfHestonGeneric(om,S,tau,r,q,param) 

2|% cfHestonGeneric.m -- version 2010-10-04 

3|% S = spot 

4|% tau = time to mat 

5|% r = riskfree rate 

63 q = dividend yield 

7|% vO = initial variance 

8|% vT = long run variance (theta in Heston’s paper) 

9|% rho = correlation 

10|% k = speed of mean reversion (kappa in Heston’s paper) 
11|% sigma = vol of vol 

12 

13| v0 = param(1); 

14| vT = param(2); 

15| rho = param(3); 

16| k = param(4); 

17| sigma = param(5); 

18 

19}d = sqrt( (rho * sigma * lixom - k).^2 + sigma^2 x 

20 (lixom + om .^ 2) ); 

21|g = (k - rhoxsigmaxlixom - d) ./ (k - rhoxsigmaxlixom + d); 
22| cf1 = lixom .* (log(S) + (r - q) * tau); 

23| cf2 = vT x k / (sigma^2) * ((k - rhoxsigma*xlixom - d) * .. 
24 tau - 2 * log((1 - g .* exp(-d * tau)) ./ (1 - g))); 
25| c£3 = v0/sigma^2 * (k - rhoxsigma»xli»xom - d) .* 

26 (1 - exp(-d»xtau)) ./ (1 - g .* exp(-d * tau)); 

27|cf = exp(cf1 + cf2 + cf3); 


The Bates model 


This model, described in Bates (1996), adds jumps to the dynamics of the Heston model. The stock 
price S and its variance v are described by 


dS, = (r — q — àu) Sidt +./%;,S,dz\) + J, S,dN; 
du; =x (8 — vi)dt + o Jude . 


N; is a Poisson count process with intensity A; hence the probability to have a jump of size one is 
Adt. As in Merton’s model, the logarithm of the jump size J; is distributed as a Gaussian, that is, 


v 
log. + J) =N (1a +u) — = vs) 


The characteristic function becomes (Schoutens et al., 2004):° 


PBates = eT B+ C+D (17.16) 


with 


A =iwso +iw(r — q)T 


k , 1— ge% 
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5. For R, see function cfBates in the NMOF package. 
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D = —ìujiot + x(a + wp ieerrsiolio-V) = i) 


d= J ociv — K)? +0? (iw + o?) 
k — poiw — d 


s= k- pein +d 


Since the jumps are assumed independent, the characteristic function is the product of @yeston and 
the function for the jump part (D). We will see below (see Fig. 17.3 on page 568) that adding jumps 
makes it easier to introduce curvature into the volatility surface, at least for short maturities. 


Listing 17.7: C-OptionCalibration/M/./Ch15/cfBatesGeneric.m 


1) function cf = cfBatesGeneric(om,S,tau,r,q,param) 

2|% cfBatesGeneric.m -- version 2010-10-24 

3/% S = spot 

4|% tau = time to mat 

5|% r = riskfree rate 

63 q = dividend yield 

7|% v0 = initial variance 

8|% vT = long run variance (theta in Heston’s paper) 

9|% rho = correlation 

10|% k = speed of mean reversion (kappa in Heston’s paper) 
ll|% sigma = vol of vol 

12|% -- jumps -- 

13|% lambda = intensity; 

14) % mud = mean of jumps; 

15|% vd = variance of jumps; 

16| % 

17| vO = param(1); 

18) vT = param(2); 

19| rho = param(3); 

20| k = param(4); 

21| sigma = param(5); 

22| % 

23| lambda= param(6); 

24| mud = param(7); 

25| vJ = param(8); 

26 

271d = sqrt( (rho » sigma * lixom - k).^2 + 

28 sigma^2 x (lixom + om .^ 2) ); 

291g = (k - rhoxsigmaxlixom - d) ./ (k - rhoxsigmaxlixom + d); 
30| cf1 = lixom .» (log(S) + (r - q) * tau); 

31| cf2 = vT*k / (sigma^2) * ((k - rhoxsigmaxlixom - d) * tau - 
32 2 * log((1 - g .* exp(-d * tau)) ./ (1 - g))); 

33| c£3 = v0/sigma^2x» (k-rho*sigma*lixom-d) .*(l-exp(-d*tau)) ./ 
34 (1-g.»exp(-d»tau)); 

35|% jump 

36) cf4 = -lambda»xmuJ+*1ixtauxom + lambda*tau* 

37 ((1+mugJ).^(1ixom) .* exp( vd*(lixom/2) .* (lixom-1) )-1); 
38} c£ = exp(cfl + cf2 + cf3 + cf4); 


17.2.2 Numerical integration 
Pricing with MATLAB’s quad 


Let us start with a straightforward implementation of Eq. (17.8). To compute the integrals, we use 
MATLAB’s quad function. This function uses an adaptive quadrature algorithm with Simpson’s 
rule (this quadrature rule is described below). For a pedagogical description of quad, see Moler 
(2004, Chapter 6). For more details on quad, see Gander and Gautschi (2000). 
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Black-Scholes 
The classic formula was given above. With the characteristic function: 
Listing 17.8: C-OptionCalibration/M/./Ch15/callBSMcf.m 


function call = callBSMcf(S,X,tau,r,q,vT) 
% callBSMcf.m -- version 2010-10-24 


1 

2 

3|% S = spot 

4|% X = strike 

5|% tau = time to mat 

6/3 r = riskfree rate 

7% q = dividend yield 

8|% vT = variance (volatility squared) 

9| vP1 = 0.5 + 1/pi * quad(@P1,0,200,1e-14,[],S,X,tau,r,q,vT); 
10| vP2 = 0.5 + 1/pi » quad(@P2,0,200,1e-14,[],S,X%,tau,r,q,vT); 
11| call = exp(-q * tau) * S * vP1 - exp(-r * tau) *« X * vP2; 
12| end 


13| % 

14| function p = Pl(om,S,X,tau,r,q,vT) 

15|p = real(exp(-1lixlog(X)*om) .* cfBSM(om-1i,S,tau,r,q,vT) ./ 
16 (li * om * S x exp((r-q) * tau))); 


19| function p = P2(om,S,X,tau,r,q,vT) 

20|p = real (exp(-1lixlog(X)*om) .* cfBSM(om ,S,tau,r,q,vT) ./ 
21 (1i * om)); 

22| end 

23| % 

24| function cf = c£BSM(om,S,tau,r,q,vT) 

25|cf = exp(1li * om +» log(S) + li * tau * (r - q) * om- 

26 0.5 * tau x vT x (li * om + om .^ 2)); 

27| end 


Merton 


The classic formula can be implemented as follows. 


Listing 17.9: C-OptionCalibration/M/./Ch15/callMerton.m 


function call = callMerton(S,X,tau,r,q,v,lambda,muJ,vJ,N) 


1 

2|% callMerton.m -- version 2010-10-24 

3l% S = spot 

4|% X = strike 

5|% tau = time to mat 

6|% r = riskfree rate 

7% q = dividend yield 

8) 3 v = variance (volatility squared) 

9|% lambda = intensity of poisson process 

10| % mud = mean jump size 

11] % vJ = variance of jump process 

12|% N = number of jumps to be included in sum 

13| lambda2 = lambda*(1+muJ); call = 0; 

14| for n=0:N 

15 vin = v + n»vJ/tau; 

16 ron = r - lambdaxmuJ+ n*xlog(1+muJ) /tau; 

17 call = call + ( exp(-lambda2*tau) * (lambda2*tau)*n ) * 
18 callBSM(S,X,tau,r_n,q,v_n)/ exp( sum(log(1:n)) ); 
19| end 


6. For R, see function ca11CF in the NMOF package. The function values a European call, given a user-supplied character- 
istic function. The package already provides several characteristic functions: cEBSM, cfBates, cfHeston, cfMerton, 
and cf£VG (variance gamma). Put options may be valued through put—call parity. 
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With the characteristic function: 


Listing 17.10: C-OptionCalibration/M/./Ch15/callMertoncf.m 


l|% callMertoncf.m -- version 2010-11-13 

2| function call = callMertoncf(S,X,tau,r,q,v,lambda,mudJ,vJ) 
3/3 S = spot 

4|% X = strike 

5|% tau = time to mat 

63 r = riskfree rate 

7% q = dividend yield 

8l% v = variance (volatility squared) 

9|% lambda= intensity of poisson process 

10|% mud = mean jump size 

11] % vd = variance of jump process 

12;}vP1 = 0.5 + 1/pi >» 

13 quad(@P1,0,200,1e-14,[],S,X,tau,r,q,v,lambda,muJ,vJ) ; 
14| vP2 = 0.5 + 1/pi > 

15 quad(@P2,0,200,1e-14,[],S,X,tau,r,q,v,lambda,muJ,vJ) ; 
16| call = exp(-q *« tau) * S * vP1 - exp(-r * tau) * X * vP2; 
17| end 

18) % 

19| function p = Pl(om,S,X,tau,r,q,v,lambda,muJ,vJ) 

20|/p = real(exp(-lixlog(X)*om) .* 

21 cfMerton(om-11,S,tau,r,q,v,lambda,muJ,vJ) ./ 

22 (1i * om * S * exp((r-q) * tau))); 

23| end 

24) % 

25| function p = P2(om,S,X,tau,r,q,v,lambda,muJ,vJ) 

26/p = real(exp(-lixlog(X)*om) .* 

27 cfMerton(om ,S,tau,r,q,v,lambda,muJ,vJ) ./ (li * om)); 
28| end 

29) % 

30| function cf = cfMerton(om,S,tau,r,q,v,lambda,muJd,vJ) 

31) A = lixom*log(S) + lixom*taux* (r-q-0.5*v-lambdas*muJ) - 

32 0.5x (om.*2) *v«xtau; 

33|B = lambdaxtau* (exp (lixom*log(1+muJ)-0.5*1lixoms«vd - 

34 0.5*vJ*om.*2) -1); 

35} cf = exp(A + B); 

36| end 

Heston 


With MATLAB, we can price call options under the Heston model with the following function. 
Listing 17.11: C-OptionCalibration/M/./Ch15/callHestoncf.m 


1| function call = callHestoncf(S,X,tau,r,q,v0,vT,rho,k,sigma) 
2|% callHestoncf.m -- version 2010-10-25 

3|% callHestoncf Pricing function for European calls 

4|% callprice = callHestoncf(S,X,tau,r,q,v0,vT,rho,k,sigma) 
5|% --- 

6/3 S = spot 

7|% X = strike 

8|% tau = time to mat 

9% r = riskfree rate 

10% q = dividend yield 

11|% vo = initial variance 

12|% vT = long run variance (theta in Heston’s paper) 

13) % rho = correlation 

144% k = speed of mean reversion (kappa in Heston’s paper) 
15|% sigma = vol of vol 

16| vP1 = 0.5 + 1/pi * ... 

17 quad1(@P1,0,200,[],[],S,X,tau,r,q,v0,vT,rho,k, sigma) ; 
18| vP2 = 0.5 + 1/pi > 


19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 


call 
end 
% 

func 
Isti 
p = 
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quad1 (@P2,0,200,[],[],S,X,tau,r,q,v0,vT,rho,k,sigma); 
= exp(-q * tau) * S * vP1 - exp(-r * tau) * X * vP2; 


tion p = P1(om,S,X,tau,r,q,v0,vT,rho,k,sigma) 

real (exp(-ixlog(X)*om) .x 
cfHeston(om-i,S,tau,r,q,v0,vT,rho,k,sigma) ./ 
(i * om * S * exp((r-q) * tau))); 


end 
% 
function p = P2(om,S,X,tau,r,q,v0,vT,rho,k, sigma) 
isti; 
p = real(exp(-ixlog(X)»xom) .* 
cfHeston(om,S,tau,r,q,v0,vT,rho,k,sigma) ./ (i * om)); 
end 
% 
function cf = cfHeston(om,S,tau,r,q,v0,vT,rho,k,sigma) 
d = sqrt((rho * sigma x lixom - k).^2 + sigma^2 x 
(lixom + om .^ 2)); 

g2 = (k - rhoxsigmaxlixom - d) ./ (k - rhoxsigmaxlixom + d); 
cfl = lixom .* (log(S) + (r - q) * tau); 
cf2 = vT x» k / (sigma^2) * ((k - rhoxsigma*xlixom - d) * 

tau - 2 * log((1 - g2 .* exp(-d * tau)) ./ (1 - g2))); 
cf3 = v0 / sigma^2 * (k - rhoxsigma*xlixom - d) .* 

(1 - exp(-d * tau)) ./ (1 - g2 .* exp(-d * tau)); 
cf = exp(cfl + cf2 + cf3); 
end 
We can translate this function into R. The main ingredient, the quad function, is replaced by a 


call to integrate from the stats package, which is part of any standard installation of R. 


> callHestoncf 


function(S, X, tau, r, q, v0, vT, rho, k, sigma, 


implVol = FALSE) { 


S = spot 

X = strike 

tau = time to mat 

É = riskfree rate 

q = dividend yield 

v0 = initial variance 

vT = long run variance (theta in Heston’s paper) 
rho = correlation 

k = speed of mean reversion (kappa in Heston’s paper) 
sigma = vol of vol 

implVol = compute equivalent BSM volatility? 


f (sigma < 0.01) 
sigma <- 0.01 


1 <- function(om,S,X,tau,r,q,v0,vT,rho,k,sigma) { 


p <- Re(exp(-1i * log(X) * om) x 
cfHeston(om - 1i, S, tau, r, q, v0, vT, rho, k, sigma) / 
(li * om » S x exp((r-q) * tau))) 

p 


P2 <- function(om,S,X,tau,r,q,v0,vT,rho,k,sigma) { 


p <- Re(exp(-1i * log(X) * om) * 
cfHeston(om ,S,tau,r,q,v0,vT,rho,k,sigma) / 
(1i * om)) 


[callHestoncf] 
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p 
} 
cfHeston <- function (om,S,tau,r,q,v0,vT,rho,k,sigma) { 
d <- sqrt((rho * sigma » 1i * om - k)^2 + sigma^2 +» 
(1i * om + om ^ 2)) 
g <- (k - rho x» sigma x li * om - d) / 
(k - rho * sigma * li * om + d) 


* tau)) / 


cf1 <- 1i * om x (log(S) + (r - q) * tau) 
cf2 <- vT*k/(sigma*2)*((k - rho » sigma x 1i * om - d) x 
tau - 2 x log((1 - g * exp(-d 
(1 - g))) 
cf3 <- v0 / sigma*2 +» (k - rho x» sigma * li * om - d 


(1 - exp(-d * tau)) / (1 - g * exp(-d * tau) ) 
<- exp(cfl + cf2 + cf3) 


Fh Fh 


## pricing 
vP1 <- 0.5 + 1/pi * integrate(P1,lower = 0, upper = Inf, 


= 


Jek 


S, X, tau, t q, vO, vT, rho, k, sigma 


) $value 
vP2 <- 0.5 + 1/pi * integrate(P2,lower = 0, upper = Inf, 


Sr X, tau, r, q, vO, vT, rho, k, sigma 


) $value 


result <- exp(-q * tau) * S x vPl - exp(-r * tau) * X x vP2; 


## implied BSM vol 
if (implVol) { 
diffPrice <- function(vol,call,S,X,tau,r,q) { 
dl <- (log(S/X)+(r - q + vol%2/2) *tau) / (vol*sqrt 
d2 <- dl - volxsqrt (tau) 
callBSM <- S x exp(-q * tau) * pnorm(dl) - 
X x exp(-r * tau) * pnorm(d2) 

call - callBSM 


} 
impliedVol <- uniroot(diffPrice, interval = c(0.0001 
call = result, S = S, X = X, 
taù entau; 2.2 ag = -q) LEI!) 
result <- list(value = result, impliedVol = impliedv 
} 
result 


} 


<environment: namespace :NMOF> 


Note that the function also has an argument imp1Vo1 that defaults to FALS] 


(tau) ) 


, 2), 


ol) 


E. If set to TRUI 


H 


the function returns a list with the callPrice and the volatility that would give the same price 


with the BS model. Here we have used the function uniroot. 


Bates 
With MATLAB: 


Listing 17.12: C-OptionCalibration/M/./Ch15/callBatescf.m 


1 

2|% callBatescf.m -- version 2011-01-07 
3% S = spot 

4|% X = strike 

5|% tau = time to mat 

6% r = riskfree rate 

7 


= dividend yield 


oe 
Q 
l 


function call = callBatescf(S,X,tau,r,q,v0,vT,rho,k,sigma, lambda, muJ, vg) 
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8] 3 vO = initial variance 

9|% vT = long run variance (theta in Heston’s paper) 

10|% rho = correlation 

ll|% k = speed of mean reversion (kappa in Heston’s paper) 
12|% sigma = vol of vol 

13|% lambda= intensity of jumps; 

14|% mud = mean of jumps; 

15|% vd = variance of jumps; 

16}vP1 = 0.5 + 1/pi *« quadl1(@P1,0,200,[],[],S,X,tau,r,q,v0,vT,... 
17 rho,k, sigma, lambda,muJ,vJ) ; 

18) vP2 = 0.5 + 1/pi * quad1(@P2,0,200,[],[],S,X,tau,r,q,v0,vT,... 
19 rho,k, sigma, lambda,muJ,vJ) ; 

20| call = exp(-q * tau) * S x vPl - exp(-r x tau) * X x vP2; 

21| end 

22| % 


23| function p = P1(om,S,X,tau,r,q,v0,vT,rho,k,sigma,lambda,mugJ, vJ) 
24| i=1i; 

25| p = real (exp(-i*xlog(X)*om) .* cfBates(om-i,S,tau,r,q,v0,vT,... 
26 rho,k,sigma, lambda,muJd,vJ)./(i * om *« S * exp((r-q)«tau))); 
27| end 

28| % 

29| function p = P2(om,S,X,tau,r,q,v0,vT,rho,k,sigma, lambda, muJ,vJ) 
30| i=1i; 

31|/p = real(exp(-ixlog(X)*om) .* cfBates(om ,S,tau,r,q,v0,vT,... 
32 rho,k,sigma,lambda,muJ,vJ) ./ (i * om)); 

33| end 

34| % 

35| function cf = cfBates(om,S,tau,r,q,v0,vT,rho,k,sigma, lambda,mudJ,vJ) 
36/d = sqrt((rho * sigma » lixom - k).^2 + sigma*2 x 


37 (lixom + om .^ 2)); 

38| % 

39|g2 = (k - rhoxsigma*xlixom - d) ./ (k - rho*xsigma*xlixom + d); 
40| % 


41| cf1 = lixom .* (log(S) + (r - q) * tau); 
42| cf2 = vT x k / (sigma^2) * ((k - rhoxsigmaxlixom - d) * 


43 tau - 2 x log((1 - g2 .* exp(-d » tau)) ./ (1 - g2))); 

44| cf3 = vO / sigma^2 x (k - rhoxsigma*lixom - d) .x 

45 (1 - exp(-d * tau)) ./ (1 - g2 .* exp(-d * tau)); 

46| % jump 

47| cf4 = -lambda *« mug * 1i * tau * om + lambdaxtaux... 

48 ((1+muJ).*(lixom) .* exp( vd*(lixom/2) .* (lixom-1) )-1); 
49|c£ = exp(cfl + cf2 + cf3 + cf4); 

50| end 


The following example shows how to call the functions. We have wrapped all function calls in 
tic, toc calls to give an idea of the time required to compute a price. For an actual performance 
comparison, we should always run the functions many times and then look at the average elapsed 


time. 
Listing 17.13: C-OptionCalibration/M/./Ch15/example.m 


l|% example.m -- version 2010-10-24 

2| %% example BSM 

31S = 100; % spot price 

Alq = 0.08; % dividend yield (eg, 0.03) 
S| r =. 0:02; % interest rate (eg, 0.03) 
6| X = 100; % strike 

7| tau = hy % time to maturity 

8] v =-0..2%2 % variance 

9| tic, call = callBSMcf(S,X,tau,r,q,v); t=toc; 

10| fprintf ('BSM\nwith CF: %6.3f, required time: %4.3f seconds\n’,call,t) 
11 


= 
N 


tic, call = callBSM(S,X,tau,r,q,v);t=toc; 


= 
W 


ka 
A 


fprintf(’classic formula: %6.3f£, required time: %4.3f seconds\n---\n\n’,call,t) 
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15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 


29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 


48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 


63 
64 
65 


t=toc; 
%4.3£ seconds\n---\n\n 


t=toc; 


%4.3£ seconds\n---\n\n’ 


toc; 


required time: %4.3f 


t = toc; 


%% example Heston 

S = 100; 

q = 0.08; 

ig = 0.02; 

X = 100; 

tau adz 

k s “gng: % mean reversion speed (kappa in paper) 

sigma = 0.00001; % vol of vol 

rho = -0.7; % correlation 

vO RA % current variances 

vT =(0.2°2% % long-run variance (theta in paper) 

tic, call = callHestoncf(S,X,tau,r,q,v0,vT,rho,k,sigma) ; 

fprintf(’Heston\nwith CF: %6.3f, required time: 
ccarr t) 

%% example Bates 

S = 100; 

q = 0.08; 

Lr = 0-023 

X = 100; 

tau Seal 

k = 1.0; % mean reversion speed (kappa in paper) 

sigma = 0.00001; % vol of vol 

rho = -0.3; % correlation 

v0 = 0.252; % current variances 

vT =.(0.2°28 % long-run variance (theta in paper) 

lambda = 0.0; % intensity of jumps; 

mud = -0.0; % mean of jumps; 

vJ = 0.00001^2; % variance of jumps; 

tic, call = callBatescf(S,X,tau,r,q,v0,vT, 

rho,k,sigma, lambda,muJ,vJ) ; 

fprintf(’Bates\nwith CF: %6.3f, required time: 
call,t) 

3% example Merton jump--diffusion 

S = 100; 

q = 0.08; 

r = 0.02; 

X = 100; 

tau =e 

v =r 0,2423 % variance (volatility squared) 

lambda = 0.2; % intensity of jumps; 

mud = -0.1; % mean of jumps; 

vJ Ei AN2? % variance of jumps; 

N = 20; % number of jumps for classic formula 

tic, call = callMertoncf(S,X,tau,r,q,v,lambda,muJ, vJ); t = 

fprintf ('Merton jump-diffusion\nwith CF: %6.3f, 
seconds\n’,call,t) 

tic, call = callMerton(S,X,tau,r,q,v,lambda,muJd,vJ,N) ; 

fprintf(’classic formula: %6.3f, required time: 


%4.3£ seconds\n’,call,t) 


Running this code should result in output similar to the following. 


BSM 

with CF: 5.064, required time: 
classic formula: 5.064, required time: 
Heston 

with CF: 5.064, required time: 


0.004 seconds 
0.000 seconds 


0.008 seconds 
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with CF: 5.064, required time: 0.008 seconds 


Merton jump-diffusion 
with CF: 5.702, required time: 0.004 seconds 
classic formula: 5.702, required time: 0.002 seconds 


The smile, again: comparing the models 


To get some intuition about these models, we now (i) choose a model and fix parameter values, 
(ii) price a matrix of options (different strikes, different maturities) under this model, and (iii) com- 
pute the implied volatilities under BS. Figs. 17.2-17.4 show examples. On the left, we always 
plot the implied volatilities for a given time to maturity (one month and three years) but different 


0.45 0.4 
1 month 
0.3 T —— 0.3 
3 years 
0.2 T T T 1 0.2 + 7 T 1 
80 90 100 110 120 0 1 2 3 
Strike Time to maturity 
(A) The base case: S = 100, r = 2%, q = 2%, „vo = 30%, VO = 30%, p = 0, x = 1, 
a = 30%. 
0.4 0.4 
1 month 
0.3 
EEE 3 years 
T T T 1 0.2 + T 
90 100 110 120 0 1 2 3 
Strike Time to maturity 


(B) o = 90%: short-term smile (the position of the kink is controlled by p); often we need 
substantial volatility-of-volatility to induce a smile. 


0.4 0.4 
1 month 
OBA teens tae 0.3 |——__ 
3 year 
0.2 f T T T 1 0.2 T T T 1 
90 100 110 120 0 1 2 3 
Strike Time to maturity 
(C) p = —0.5: skew (a positive correlation induces positive slope). 
0.4 0.4 
1 month 
0.34 f 0.3 D 
3 years 
0.2 + T T T 1 0.2 + T T 1 
80 90 100 110 120 0 1 2 3 
Strike Time to maturity 


(D) vo = 35%, 6 = 25%: term structure is determined by the difference between current and 
long-run variance, and k. 


FIGURE 17.2 Heston model: re-creating the implied volatility surface. The graphics show the BS-implied volatilities ob- 
tained from prices under the Heston model. The panels on the right show the implied volatility of ATM options. 
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0.4 0.4 


3years 0.3 


T T 1 0.24 T T 1 
80 90 100 110 120 0 1 2 3 


Strike Time to maturity 


(A) The base case: S = 100, r = 2%, q = 2%, vo = 30%, vO = 30%, p = 0, k = 1, 
o = 0.0%, àA = 0.1, wz = 0, vz = 30%. Volatility-of-volatility is zero, as is the jump mean. 


0.4 


T T 1 0.2 + T T 
80 90 100 110 120 0 1 2 3 


Strike Time to maturity 


(B) wz = —10%: more asymmetry 


0.4 0.4 
1 month 
OB ee ce ae 0.3 4x 
0.2 + T T T 1 0.25 T T 1 
80 90 100 110 120 0 1 2 3 
Strike Time to maturity 
(C) o = 90%: stochastic volatility included. 
0.4 0.4 
1 month 
0.3 
‘3 years 
T T T 1 0.25 T T 1 
90 100 110 120 0 1 2 3 
Strike Time to maturity 


(D) uy = —10%, o = 70%, p = —0.3. 


FIGURE 17.3 Bates model: re-creating the implied volatility surface. The graphics show the BS-implied volatilities ob- 


tained from prices under the Bates model. The panels on the right show the implied volatility of ATM options. 


strikes. On the right, we plot the term structure of implied volatility, that is, the implied volatility 


of at-the-money options for times to maturity up to three years. 


The Bates model nests BS, Heston and Merton; therefore, the implied volatilities of those mod- 


els can all be reproduced with this model. 


A primer on numerical integration 


MATLAB’s quad is reliable but slow. The pricing can be accelerated by precomputing a fixed 
number of nodes and weights under the given quadrature scheme. We start with a brief overview 
of how numerical integration works. For a textbook exposition, see, for instance, Heath (2005, 
Chapter 8). Davis and Rabinowitz (2007) is a comprehensive reference. A highly recommended 
paper is Trefethen (2008a); it is one of the rare occasions where actual convergence—as opposed 


to theoretical optimality—is discussed for specific rules. 
The essence of numerical integration, or quadrature, is to replace an integral 


b 
f feo 
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0.4 0.4 
3 years 
222 0.3 
0.2 + 1 T T 1 0.2 + 7 7 1 
80 90 100 110 120 0 1 2 3 
Strike Time to maturity 


(A) S = 100, r = 2%, q = 2%, Vv = 30%, À = 0.01, wy = —50%, vz = 30%. 


0-4 FN 4 month ae 
0.3 4 0.3 
0.2 T T T 1 0.2 T T T 1 
80 90 100 110 120 0 1 2 3 
Strike Time to maturity 
(B) S = 100, r = 2%, q = 2%, „y/u = 30%, A = 0.30, wy = —2%, vy = 30%. 


FIGURE 17.4 Merton model: re-creating the implied volatility surface. The graphics show the BS-implied volatilities ob- 
tained from prices under the Merton model. The panels on the right show the implied volatility of ATM options. Importantly, 
jumps need to be volatile to get a smile (i.e., vy > 30%). 


by the sum 


X wif (i). (17.17) 


i=l 


The x; are called the nodes or abscissas, the w; are weights. We either assume there are n nodes, or 
that the interval [a, b] is subdivided into m partitions. Quadrature rules detail how to choose these 
nodes and weights. A rule is called closed if it requires evaluating the endpoints a and b; otherwise, 
the rule is called open. 

An intuitive approach is to follow Riemann’s original idea and replace the integral (17.17) by 
the sum of the area of m rectangles. Such a Riemann sum is defined as follows. Assume 


A=X] < X2 < +++ < Xm <Xm41 =), 
then any collection of nodes vg € [xx, xk+1], for k = 1,...,m, defines a Riemann sum 
m 
X Gar — x) Sf W). (17.18) 
k=1 


We define h = (b — a)/m ; then some possible quadrature rules based on Riemann sums are: 


Rectangular rule h paar f(a+kh) (evaluation on the left side), or 
hy, f(a+kh) (evaluation on the right side). 

Midpoint rule We evaluate the rectangle in the middle, hence 
nfo f(a +[k+4] i). 

Trapezoidal rule (DE f(atkh)+ fow + sp) for m > 2; or 


n( Loto) for m=1. 
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The following code shows how to implement such rules with MATLAB.’ We include a call to 
MATLAB’s quad function as a comparison. 


Listing 17.14: C-OptionCalibration/M/./Ch15/exampleQuad1.m 


1) % exampleQuad1l.m -- version 2010-10-24 

2|}%% Riemann sums - example 

3| Funl = @(x) (exp(-x)); 

4)m = 5; a = 0; b = 5; h = (b-a)/m; 

5 

6|% rectangular rule -- left 

Tiw = h; k = 0:(m-1); x = a+ k * h; 

8| fprintf(’rectangular (left) with %i rectangles:\t %f\n’,m,sum(w *« Funl(x))) 
9 

10|% rectangular rule -- right 


lljw = h; k = 1l:m; x = a+ k* h; 
12| fprintf(’rectangular (right) with %i rectangles:\t %f\n’,m,sum(w * Funl(x)) ) 


14) midpoint rule 
15|w = h; k = 0:(m-1); x = a + (k + 0.5)*h; 
16| fprintf (’midpoint with %i rectangles:\t %f\n’,m,sum(w * Fun1(x))) 


18| Strapezoidal rule 
19)w = h; k = 1:(m-1); x = [a a + keh b]; 


20) aux = w * Fun1 (x); aux([1 end]) = aux([1 end])/2; 
21| fprintf (‘trapezoidal with %i rectangles:\t %f\n’,m,sum(aux) ) 
22 


23| tadaptive Simpson 
24| fprintf ('Adaptive Simpson (Matlab):\t\t\t\t %f£\n’,quad(Funl,a,b) ) 


rectangular (left) with 5 rectangles: 1.571317 
rectangular (right) with 5 rectangles: 0.578055 
midpoint with 5 rectangles: 0.953052 
trapezoidal with 5 rectangles: 1.074686 
Adaptive Simpson (MATLAB) : 0.993262 


Interpolatory rules 


The three rules stated above partition the interval [a, b] into equal-sized subintervals and approxi- 
mate the integrand in each subinterval by a rectangle or a trapezoid. The accuracy of this approach 
improves as the number of subintervals increases. But rectangles or trapezoids may not be natu- 
ral candidates to approximate a function; if the function is smooth, we can do better. Given n + 1 
nodes, we can fit a polynomial of order n that interpolates the function values at these nodes. This 
polynomial can then be integrated exactly as an approximation of the true integral. We do not actu- 
ally have to fit and integrate a polynomial in each case; it turns out that the approach is equivalent 
to setting the weights w so that the monomials x°, x1, ..., x” are integrated exactly; see Davis and 
Rabinowitz (2007, Chapter 2) for a proof. For equidistant nodes, the resulting quadrature schemes 
are called Newton—Cotes rules. 

Assume we wish to determine a k-point Newton—Cotes rule: we fix x1, x2, ..., xx, then choose 
the w1, w2,..., wx so that the resulting rule integrates the polynomials x°=1,x1,..., x47! exactly 
on the interval [a, b]. We obtain 


b 
0 0 0 Oy. a2 
wiat + ward twat nap =f 1dr=b—a (17.19) 
a 
b 
1 
wxi + waxy twsa +t nae | xdr = 70-a)? 
a 


7. For an R implementation, see the Examples section of function xwGauss in the NMOF package. 
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b 
1 

wis} buns} + aad bot nag = f 22d = z6- 

a 


b 
= = _ = 1 
wixt "+ woxk 1+ waxt Ve. weak taa 'dxr=7%- a. 
a 
This can be rewritten conveniently as 
1 1 1 ae 1 wi b—a 
XO O X2 X3 a Xk | |w 1(b — a}? 
= (17.20) 
i ie xý! yi a Wk 1(b — a) 


and then solved for w. (Compare this with the rules based on Riemann sums: for equidistant nodes, 
all function values were equally weighted there.) 
With h = (b — a)/m = (b — a)/(n — 1), we have the following closed rules: 


n Name x w 

2 Trapezoidal rule a, b Fh, th 

3 Simpson’s rule a,ath, b ih, $h, zh 

4 Simpson's 3/8-rule a, ath, a+2h, b 3h, 3h, gh, th 

5 Boole’s rule a, ath, a+2h, a+3h, b 14, 44 Ap, n lih 


Newton—Cotes rules of very high order are rarely used, though, since convergence is not guaranteed 
for n — ov, and (see Eq. (17.20)) become ever more badly conditioned as n increases. Instead, 
the interval of integration is subdivided into smaller subintervals, and to each a low-order rule is 
applied. Such an implementation is called a composite (or compound) rule. 

The reasoning of Eq. (17.19) can be taken one step further by also freely choosing the x;. This 


will leave us 2n variables, the w and the x. Choosing them such that they integrate x9 x) x2, 


xen 


exactly leads to Gauss rules. In principle, we could use the approach of Eq. (17.19), but this 
leads to nonlinear equations that are much harder to solve. Fortunately, nodes and weights can also 
be computed in alternative ways; for instance, as the zeros of certain polynomials. We outline next 
how to compute the nodes and weights as suggested by Golub and Welsch (1969). 


Finding the nodes for Gauss rules 


Let po, 91, 92.. --, Pn be a sequence of orthogonal polynomials, that is, 


b 
fo (x)gj(x)w(x)dx =0, for alli 4 j. 


a 


The subscripts of the g indicates the order of the polynomial. Orthogonality holds with respect to a 
weight function w, and for an interval [a, b]. The zeros of gn (x), that is, the x that satisfy pn (x) = 0, 
are the nodes of an n-point Gauss rule for the interval [a, b]. If the sequence is also normalized, so 
that we have 
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b 
[ ecreicnotear =1, for alli, 


a 


we call the polynomials orthonormal. 
For orthogonal polynomials, the following three-term recurrence relation holds: 


Pn (x) = (nx + Bn) Pn—1(%) — YnGn-2(%), n Zl (17.21) 


and ¢_; = 0 and ġọ = 1. The numbers œn, n, and yp are functions of the coefficients of the 
polynomials. We can rearrange (17.21) into 


o1 Bn Yn 
XPn—1 (4) = ~~ Onl) + | — Pn—1(X) + On—2(X). (17.22) 


n 
ôn 


For normalized polynomials, y„ is equal to %/z,_;, so the relation becomes even simpler. In such 
a case, the coefficient of g,-2(x) becomes 7 L . (This is a way to check if the polynomials are 
normalized: reduce n in the coefficient of pn (x) by —1, then the coefficient of gn—2 (x) must result.) 


We can put Eq. (17.22) into matrix notation (Wilf, 1978), with øn (x) = 0 for n <0: 


sp 
go(x) ot go(x) 0 
1 
gi (x) eee gi(x) 
x| 92) |= B og `. pa) |4 : (17.23) 
a3 of 
i 0 
(x) TT gyi) 
— n= a, 
Pn-1 Ya 5, a Gn (x) 
(x) ee ama P(x) 
A 


Now assume we insert x* into Eq. (17.23), and x* is a zero of gn. Then the last term in (17.23) 
vanishes, and we are left with 


x* O(x*) = A®(x"*). 


This equation can only hold if x* is an eigenvalue of A, hence the zeros of gn, and thus the nodes of 
an n-point Gauss rule, are the eigenvalues of A. This matrix can be made to have even more struc- 
ture: if the polynomials are normalized, A will be symmetric. If A is not symmetric, we can perform 
a diagonal similarity transformation (i.e., make A symmetric while not changing its eigenvalues). 


For this, we define 
Yn+1 
Nn = —_- 
AnAn+1 


ôi M 
nı 62 m 


and replace the matrix A by 


B= n2 63 


"n-1 
M-1 ôn 
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The matrix B has the same eigenvalues as A, that is, it returns the same Gauss nodes. But since it 
is symmetric, there are more-efficient algorithms to compute the eigenvalues. MATLAB’s eig will 
exploit symmetry but not the fact that B is tridiagonal. Since the eigenvalues of a symmetric matrix 
are all real, we also have a proof that the Gauss nodes are real. 

Having computed the nodes, we could compute weights with Eqs. (17.19) and (17.20); but the 
weights can also be obtained from the eigenvectors of B. More specifically, the weight correspond- 
ing to an eigenvalue/node is given by 


b 


e? / @(x)dx 


a 


where € is the first element of the eigenvector that belongs to the particular eigenvalue. 

For many polynomials, the a, 6, and y from Eq. (17.21) are known, and hence A (and its 
symmetric counterpart) can be set up. We will later on use the Legendre polynomials gk. They 
have weight function w(x) = 1 and are defined on the interval [—1, 1]. Abramowitz and Stegun 
(1965) tell us that 


(2+ Dek, = 2n + Dxglt —ngh_,, 


or 


with b, = 0. Note that these polynomials are not normalized. So we set up A and use the above 


expressions, arriving at 6, = 0, and 
_ 1 
n= aa 
With MATLAB:® 


Listing 17.15: C-OptionCalibration/M/./Ch15/GLnodesweights.m 


1} function [x,w] = GLnodesweights(n) 

2|% GLnodesweights -- version 2010-10-24 

3|% (G_auss L_egendre...) 

4;eta = 1 ./ sqrt (4-(1:(n-1)).*(-2)); 

5|A = diag(eta,1) + diag(eta,-1); 

6| [V,D] = eig(A); 

7|x = diag(D); 

8|% Matlab does not guaranty sorted eigenvalues 
9| [x,i] = sort (x); 

10|% weights: for Legendre w(x)=1; integral from -1 to 1 = 2 
lljw = 2 * V(1,i) .* 2; 


A Gauss rule for an interval [ao, bo] can be transferred to an interval [a b] as follows (Heath, 2005, 
pp. 352-353): 


= (b — a)x + abo — bag 
- bo — ao 
, b-a 
— w 
bo — ao 


Ww 


8. For an R implementation, see the function xwGauss in the NMOF package. 
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It is convenient to put this transformation into a function. 


Listing 17.16: C-OptionCalibration/M/./Ch15/changeInterval.m 


1) % changeInterval.m -- version 2010-10-24 
2| function [x,w] = changeInterval(x, w, aFrom, bFrom, aTo, bTo) 
3}a0 = aFrom; b0 = bFrom; % interval for Gauss rule 
4ļa = aTo; b = bTo; 
S|x = ((b-a)*x + axb0-bxa0) /(b0-a0) ; % new nodes 
6)w = w» (b-a)/(b0-a0); % new weights 
Listing 17.17: C-OptionCalibration/M/./Ch15/exampleQuad2.m 

l|% exampleQuad2.m -- version 2010-10-24 
2|% continues exampleQuad1.m 
3 
4|% compute nodes/weights 
5| [x,w] = GLnodesweights (m); 
6|% change interval of integration 
7| [x,w] = changeInterval(x, w, -1, 1, 0, 5); 
8| fprintf(’Gauss-Legendre:\t %f\n’, w * Funl(x)) 

Gauss-Legendre: 0.993260 

Example 17.3 


The distribution function of a Gaussian random variable is given by 


b 


1 3 

N(b) = — / edu, 

J 20 
—00 


This integral cannot be computed analytically. We try a Gauss rule. But how do we integrate from —oo? 
If a non-periodic function from or to infinity is to have an integral, it needs to be at zero most of the 
time. For the integrand here, the Gauss density, we know that it is essentially zero when—let us be 


ee 
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conservative—its argument is smaller than —10 or greater than 10. So: 


Listing 17.18: C-OptionCalibration/M/./Ch15/integrateGauss.m 


% integrateGauss.m -- version 2010-10-24 
3% (also tested with Octave 4.2.2) 


% goal: to compute N(b) 
by 23 


% number of nodes 

no 256 

% replace minus infinity by 

lowerLim = -10; 

% compute nodes/weights 

[x,w] = GLnodesweights(n) ; 

% change interval of integration 

[x,w] = changeInterval(x, w, -1, 1, lowerLim, b); 
% result of integration 

ourResult = w*GaussF (x) 

% result of normcdf from Statistics Toolbox 
MatlabResult = normcdf (b) 


abs (ourResult-MatlabResult) 
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The absolute error against MATLAB’s normcdf is 


ans = 
5.9397e-014 


A remark: integration rules like Gauss—Legendre (or others, e.g., Clenshaw—Curtis) prescribe 
to sample the integrand at points that cluster around the endpoints of the interval. This happens be- 
cause essentially a Gauss rule approximates the function to be integrated by a polynomial, and then 
integrates this polynomial exactly. Gauss rules are even optimal in the sense that for a given number 
of nodes, they integrate exactly a polynomial of the highest order possible. For an oscillating func- 
tion, however, we may need a very-high-order polynomial to obtain a good approximation, and, 
therefore, alternative rules may be more efficient for such functions (Hale and Trefethen, 2008). 

Suppose we want to approximate the following functions by polynomials (the third function is 
taken from Gander and Gautschi, 2000). 


(i) x=e% for t €[0, 10] 
(ii) x=sin(t) for te [0,107] 


t+1 ift<1 
(iii) x={3-—t ifl<t<3 
2 ift>3 


The following figures show the function (in gray) and the approximation by a polynomial of order 
5 (in dashed black). We can barely distinguish function (i) from its polynomial approximation any- 
more, but for function (ii) and (iii) we need very-high-order polynomials, even though the functions 
are not necessarily difficult, in particular, function (iii). 


Here is the code to fit the polynomials: 


Listing 17.19: C-OptionCalibration/M/./Ch15/exPolynomial.m 


% nice example 

Funl = @(x) (exp(-x)); 

a = 0; b = 10; points = 200; 

t = linspace(a,b,points) ; 

x = Funl(t); 

plot(t,x);ylim([-0.5 1]);grid on, hold on 
Q 


p = polyfit(t,x,5);f = polyval(p,t); 
plot (t, f,'k--') 


Donun DUnRARUN=e 


Listing 17.20: C-OptionCalibration/M/./Ch15/exPolynomial.m 


l|% not so nice example 1 

2}a = 0; b = 10*pi; points = 200; 

3)t = linspace(a,b,points) ; 

4|x = sin(t); 

5| plot(t,x);ylim([-1.5 1.5]);grid on, hold on 
6|% 

7|p = polyfit(t,x,5);f = polyval(p,t); 

8| plot (t, f,'k--'); 
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Listing 17.21: C-OptionCalibration/M/./Ch15/exPolynomial.m 


oe 


not so nice example 2 

= -5; b = 5; points = 200; 

= linspace(a,b,points) ; 

= testFun(t)’; 

lot (t,x) ;ylim([-4 4]);grid on; hold on 


POU x ct ow 


p = polyfit(t,x,5);£ = polyval(p,t); 
plot(t,f,’k--'); 


CADMNBWN Ee 


Note that we have used the MATLAB function polyfit to find the coefficients of the polynomial. 
We could have computed them ourselves with the following regression. 


Listing 17.22: C-OptionCalibration/M/./Ch15/exPolynomial.m 


1l% ... fit polynomial 
= Pe .°O 6,8 £P.92 67,82 6° .°R eS aes 


N 
kel 
N 


MATLAB will do some rescaling of the Vandermonde matrix [t° t! t? ...]. The function testFun 
is given by 


Listing 17.23: C-OptionCalibration/M/./Ch15/testFun.m 


1|% testFun.m -- version 2010-10-30 

2| function z = testFun(x) 

3) ind = NaN(length(x),1); 

4| ind(x<=3 & x>=1)=0;ind(x<1)=-1;ind(x>3) = 1; 
5| % 

6| z(ind==0) = 3-x(ind==0) ; 

7| z(ind==-1) = x(ind==-1)+1; 

8| z (ind==1) = a] 

9| z=z'; 


Pricing tests 


In this section and the final part of this chapter, we present some results from Gilli and Schumann 
(2011a). There we used a Gauss—Legendre rule. We experimented with alternatives like Gauss— 
Lobatto as well, but no integration scheme was clearly dominant over another, given the required 
precision of our problem (there is no need to compute option prices to eight decimals). 

To test the pricing algorithms, we first investigate the BS model and Merton’s jump-—diffusion 
model. For these models, we can compare the solutions obtained from the classic formulas with 
those from integration. Furthermore, we can investigate several polar cases: for Heston with zero 
volatility-of-volatility, we should get BS prices; for Bates with zero volatility-of-volatility, we 
should obtain prices as under Merton’s jump diffusion (and, of course, Bates with zero volatility- 
of-volatility and no jumps should again give BS prices). 

When we calibrate models, then we price not just one option, but a whole array of different 
strikes and different maturities. But for a given set of parameters that describe the underlying pro- 
cess of the model, the characteristic function @ only depends on the time to maturity, not on the 
strike price. This makes sense because ¢@ is a transform of the stock density. This density is not 
influenced by the option’s strike. 

This suggests that speed improvements can be achieved by preprocessing those terms of ġ that 
are constant for a given maturity, and then compute the prices for all strikes for this maturity. See 
Kilin (2011) for a discussion and Algorithm 65 for a summary. 


Speed 


First, we compare the performance of direct integration with the performance of quad. Example 
code for the Bates model follows; we start by fixing parameters. 
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Algorithm 65 Computing the prices for a given surface. 


1: 


set parameters, set maturities {T}, set strikes {X} 
: fort € {t} do 

compute characteristic function @ 

for X € {X} do 

compute price for strike X, maturity t 

end for 
end for 
compute objective function 
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Listing 17.24: C-OptionCalibration/M/./Ch15/examplePricing.m 


% examplePricing.m -- version 2011-01-02 

%% price matrix of options 

S = 100; q = 0.02; r = 0.02; 

XX = 70:5:130; % strikes 

[1/12 3/12 6/12 9/12 1 2 3]; % time to maturity in years 
nS = length(XX); nT = length(TT); 

% example parameters Bates 

v0 = Oy 2°2: 
vT =: 0.225 
rho = -0.7; 


H 
(l 
I 


oe 


current variances 

long-run variance (theta in paper) 
correlation 

mean reversion speed (kappa in paper) 
vol of vol 

intensity of jumps; 


de de 


de de 


1410 
sigma = 0.3; 
lambda = 0.1; 
mud = -0.2; 
vJ = 0.1% 
runs = 100; 


de de 


mean of jumps; 
variance of jumps; 
for tic/toc 


de Æ 


The first strategy is to loop over the strikes and maturities, and each time price the option with 
callBatescf. 


19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 


Listing 17.25: C-OptionCalibration/M/./Ch15/examplePricing.m 


pricesl = NaN(nS,nT); 
tLe 
for rr = 1:runs 
for kk = 1:nS 
for tt = 1:nT 
prices1(kk,tt) = callBatescf(S,XX(kk),TT(tt), 
r,q,v0,vT,rho,k,sigma, lambda,muJ,vd) ; 
end 
end 
end 
tl = toc; 


Next we test the integration approach with fixed nodes and weights. We get a speedup of about 200. 


32 
33 
34 
35 
36 
37 
38 
39 
40 
41 


Listing 17.26: C-OptionCalibration/M/./Ch15/examplePricing.m 


cfGeneric = @cfBatesGeneric; 

param(1) = v0; 

param(2) = vT; 

param(3) = rho; 

param(4) = k; 

param(5) = sigma; 

param(6) = lambda; 

param(7) = mug; 

param(8) = vd; 

prices2 = NaN(nS,nT); % matrix of model prices 
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42 
43 
44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 


from = 0; to = 200; N = 50; 
[x,w] = GLnodesweights (N); 
[x,w] = changeInterval(x, w, -1, 1, from, to); 
auxX = NaN(nS,N); % 
tic 
for rr = 1:runs 
ix = li «*« x; 


for tt = 1:nT 
tau = TT(tt); 
% evaluate CF at nodes 
CFi = S * exp((r-q) * tau); 


CF1 = cfGeneric(x - 11,S,tau,r,q,param) ./ (ix * CFi); 
CF2 = cfGeneric(x ,S,tau,r,q,param) ./ ix; 
for kk = 1:nS 
X = XX(kk); 
if tt == 1 % store for later maturities 
auxX(kk,:) = exp(-ix*log(X))’‘; 
end 
P1 = 0.5 + w * real(auxX(kk,:)’ .* CF1) / pi; 
P2 = 0.5 + w * real (auxX(kk,:)’ * CF2) / pi; 
prices2(kk,tt) = exp(-q * tau) *« S x P1 - 
exp(-r * tau) * X x P2; 
end 
end 


t2 = toc; % compute speedup: t1/t2 


Note that there is room for many small improvements: the ubiquitous e~’* terms can be precom- 
puted; more importantly, in the characteristic function many terms can be stored. For instance, the 
terms d and g are independent of t, and could be computed once for the whole matrix. 


Accuracy 


We also need to check the numerical accuracy” of our pricing methods. The BS model is an ideal 
candidate to compare prices computed via integration with the analytical solution; see the MATLAB 
script compareAccuracy.m. It presents three ways to compute the BS prices: with the classi- 
cal formula (variant 1), through integration with fixed nodes and weights (variant 2), and 
through integration with quad (variant 3). 
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Listing 17.27: C-OptionCalibration/M/./Ch15/compareAccuracy.m 


% compareAccuracy.m -- version 2010-11-05 
%% price matrix of options 
S = 100; q = 0.02; © = 0.02; 
XX = 7045:1303 % strikes 
TT = [1/12 3/12 6/12 9/12 1 2 3]; % time to maturity in years 
nS = length(XX); nT = length(TT); 
Ve e ganay % BS parameter (vol squared) 
%% variant 1 
prices1 = NaN(nS,nT); 
for kk = 1:nS 
for tt = Int 
pricesi(kk,tt) = callBSM(S,XX(kk),TT(tt),r,q,v); 
end 
end 
%% variant 2 
cfGeneric = @cfBSMGeneric; 
param(1) = v; 


. Accuracy here refers to the size of numerical errors compared with the trustworthy numerical benchmark. 
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20| from = 0; to = 200; N = 50; 

21| [x,w] = GLnodesweights(N) ; 

22| [x,w] = changeInterval(x, w, -1, 1, from, to); 

23| prices2 = NaN(nS,nT); % matrix of model prices 

24| auxX = NaN(nS,N); ix = 1i + x; 

25| for te = Lent 

26 tau = TP(tet); 

27 % evaluate CF at nodes 

28 CFi = S * exp((r-q) * tau); 

29 CF1 = cfGeneric(x - 11,S,tau,r,q,param) ./ (ix * CFi); 
30 CF2 = cfGeneric(x ,S,tau,r,q,param) ./ ix; 
31 for kk = 1:nS 

32 X = XX(kk); 

33 if tt == 1 % store for later maturities 

34 auxX(kk,:) = exp(-ix*log(X))'; 

35 end 

36 P1 = 0.5 + w * real(auxX(kk,:)’ .* CF1) / pi; 
37 P2 = 0.5 + w * real (auxX(kk,:)’' .* CF2) / pi; 
38 prices2(kk,tt) = exp(-q * tau) *« S x Pl - 

39 exp(-r * tau) * X * P2; 

40 end 

41| end 

42 

43| %% variant 3 

44| prices3 = NaN(nS,nT); 

45| for kk = 1:nS 

46 for tt = tnr 

47 prices3(kk,tt) = callBSMcf(S,XX(kk),TT(tt),r,q,v); 
48 end 

49| end 

50 

51] %% compare 

52| max (max (abs (100* (prices2-prices1)))) 

53| max (max (abs (100* (prices3-prices1)))) 

54| surf (prices2./prices1-1) 


Figs. 17.5 and 17.6 show the numerical differences. We see that it is important to look at both 
absolute and relative errors (see also Section 2.2 on page 20). With 25 nodes, there is one substantial 
relative pricing error of more than 100%. But the price of this option is 0.00. We can increase the 


precision by using more nodes. With 50 nodes, we get the errors in Fig. 17.6. 


The accuracy when integrating “by hand” is even higher than with quad. If (repeat: if) we 
wanted the same numerical accuracy, we could change the function cal1BSMcf as follows: 


oe 


vP1 = 0.5 + 1/pi * quad(@P1,0,200,1.0e-12,[],S,X,tau,r,q,vT) ; 
vP2 = 0.5 + 1/pi * quad(@P2,0,200,1.0e-12,[],S,X,tau,r,q,vT) ; 


oe 


That is, we enforce a lower tolerance than the default which is 1076, 
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FIGURE 17.5 Relative (left) and absolute (right; in cents) price errors for BS with direct integration with 25 nodes (com- 


pared with analytical solution). 


580 PART | III Optimization 


150 150 


Price error in % 


1 ae 100 
2 
3 50 Strike 
Time to maturity in years 


Price error in cents 


3 50 A 
Time to maturity in years Strike 


FIGURE 17.6 Relative (left) and absolute (right; in cents) price errors for BS with direct integration with 50 nodes (com- 
pared with analytical solution). 


17.3 Calibration 


As we pointed out at the start of the chapter, our calibration problem will be to find parameters such 
that the model’s prices are consistent with market prices. This can be written as an optimization 
problem of the form 


model market 
C; = C; 


| 
min > amie (17.24) 
i=l i 


where M is the number of market prices. Alternatively, we could specify absolute price differences, 
use squares instead of absolute values, or introduce weighting schemes. The choice of the objective 
function depends on the application; in the end, it is an empirical task to determine a good objective 
function. Since here we are interested in numerical aspects, we will use specification (17.24). 

It turns out that this problem, like many others in finance, is not easy to solve, and gradient-based 
methods will likely fail to do the job. Thus, we use heuristics. The results described in this section 
are based on Gilli and Schumann (201 1a). 


17.3.1 Techniques 


We apply Differential Evolution (DE) and Particle Swarm Optimization (PSO). These methods have 
been described in Chapter 12, so we will not repeat the algorithms here. We will use a MATLAB 
implementation; the code is given below. 

Population-based methods like PSO and DE are often effective in exploration. They can quickly 
identify promising areas of the search space; but then these methods converge only slowly. In the 
literature we thus often find combinations of population-based search with local search (in the sense 
of a trajectory method that evolves only a single solution). An example of such a combination 
are Memetic Algorithms (Moscato, 1989). So we also test a simple hybrid based on this idea; 
it combines DE and PSO with a direct search component. In the classification systems of Talbi 
(2002) or Winker and Gilli (2004), this is a high-level relay hybrid (see also the brief discussion in 
Chapter 12, page 282). 

Preliminary tests suggested that the objective function is often flat, thus different parameter 
values give similar objective function values. This indicates that (i) our problem may be sensitive 
to small changes in the data when we are interested in precise parameter estimates; and that (ii) if 
we insist on precisely computing parameters, we may need either many iterations, or an algorithm 
with a large step size. Thus, as a local search strategy, we use the direct search method of Nelder 
and Mead (1965) as implemented in MATLAB’s fminsearch. This algorithm can change its step 
size; it is also robust in case of noisy objective functions (e.g., functions evaluated by numerical 
techniques that may introduce truncation error, as could be the case here). The hybrid is summarized 
in Algorithm 66. 

For an implementation, we need to decide how often we start the direct search, how many 
solutions we select, and how we select them. With just one generation and ns equal to the population 
size, we would have a simple restart strategy for the direct search method. 

Nelder—Mead direct search was described in Section 11.4.4; Fig. 17.7 recalls its operations. Let 
us give a few more details on how the method operates. When we call the function fminsearch, 
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Algorithm 66 Hybrid search. 
1: set parameters for population-based method 
2: for k = 1 tong do 
3 do population-based search 


4 if local search then 
5 select ns solutions as starting values for local search 
6: for each selected solution do 
7 perform direct search and update solution 
8 end for 
9 end if 
10: end for 


11: return best solution 


A simplex ( p=2) Reflection Expansion 
X3 X3 X% 
X [N Xp X Xp x DS Xo 
XR 
XE 
Outside contraction Inside contraction Shrinking 
X3 X3 X3 
Xx Xo x; YOR Xp x N Xp 
XR 
FIGURE 17.7 Operations of Nelder—Mead. 
MATLAB will transform our initial guess x into 
xD xO 4 ox px) x) a x) 
x 2) x 2) x 4 ox x 2) ae x 2) 
x) x?) xz?) xD pex® x2) (17.25) 
xP) xP) xP) xP) a. xP) A ex (P) 


where the superscript “ denotes the ith element of x. In the implementation used here (Matlab 
2008a), £ is 0.05. If x is zero, then ex” is set to 0.00025. 

The simplex adjusts to the contours of the objective function (i.e., it can stretch itself) and so can 
make larger steps into favorable directions. But this flexibility can also be a disadvantage. Try to 
visualize a narrow valley along which a long-stretched simplex advances. 
If this valley were to take a turn, the simplex could not easily adapt (see Wright, 1996, for a discus- 
sion of this behavior). Some testing showed that this phenomenon occurs in our problem. When we 
initialize the simplex, the maximum of a parameter value is 5% greater than its minimum, and this 
is true for all parameters by construction; see Eq. (17.25). Thus, the stretch in relative terms along 
any dimension is the same. When we run a search and compare this initial with the final simplex, 
we often find that the stretch along some dimensions is 200 times greater than along other dimen- 
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sions; the condition number of a simplex often increases from 10° or so to 10! and well beyond. 
This is a warning sign, and indeed it turns out that restarting the algorithm, that is, re-initializing the 
simplex several times, leads to much better solutions. (A remark: for practical applications, restart- 
ing Nelder—Mead by re-initializing the simplex is often a cheap and thus helpful way to improve 
solution quality. We are not aware of any implementation that does this automatically.) 


Constraints 


We constrain all heuristics to favor interpretable values: we want nonnegative variances, correlation 
between —1 and 1, and parameters like x, ø, and A also nonnegative. There are various ways 
how these constraints could be implemented. Straightforward is a penalty term: for any violation 
a positive number proportional to the violation is added to the objective function. Here we use a 
repair function, described below. 

Note that we are actually cheating a bit: for the hybrid, we do not adapt Nelder—Mead to include 
the constraints. This would not be necessary if we were using penalties; adding a penalty actually 
changes the problem into an unconstrained problem and, hence, we need not adapt Nelder—Mead. 
We could modify Nelder—Mead so that it takes care of constraints (e.g., by rejecting operations that 
lead to infeasible candidate solutions), but we found that this is not necessary here. The problem 
is, practically, only mildly constrained, since we only want to enforce that our parameters remain 
meaningful; the constraints are not part of the financial model. 


17.3.2 Organizing the problem and implementation 


We implement both DE and PSO with MATLAB. The functions follow in a straightforward way 
from the algorithms given in Chapter 12 (see, in particular, the appendix of that chapter). Note 
that, unlike in R, we need not put too much emphasis on vectorization; loops are for the most part 
comparable in speed to vectorized computations with MATLAB. 

We start with a matrix of given prices, named prices0. This matrix has in its rows the different 
strikes, and in its columns the different times to maturity. The strikes are stored in a vector XX and 
time to maturity in years is a vector TT. We collect all such pieces of information in a structure 
Data. When we later call the objective function, we shall pass as arguments a particular solution 
and the structure Data. 


Data.model a string (heston, bates, merton, or bsm) 
Data.S the stock price 

Data.g the dividend yield 

Data.r the interest rate 


Data.prices0 the market prices (a matrix of size nS x nT) 


Data. XX a vector of strikes 

Data.TT a vector of times to maturity 

Data.x integration nodes 

Data.w integration weights 

Data.Dce [minP maxP], matrix with min and 
max for parameters 

Data.d length (minP) 


The functions DE and PSO are later called with three arguments: an objective function OF, a 
structure Data, and a structure P. The objective function is a function called with two arguments: 
a particular solution (a vector), and the structure Data. The structure P holds all settings of the 
optimization procedure. 

The script cal ibOF .m gives an implementation of the objective function. 


Listing 17.28: C-OptionCalibration/M/./Ch15/calibOF.m 


1| function [res, me] = calibOF(param, Data) 
2|% calibOF.m -- version 2011-01-16 
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3| model = Data.model; 

4|S = Data.S; q = Data.q; r = Data.r; 

5| XX = Data.XX; TT = Data.TT; 

6| prices0 = Data.prices0; 

7|x = Data.x; w = Data.w; 

8 

9| if strcmp (model, ’heston’ ) 

10 cfGeneric = @cfHestonGeneric; 

11] elseif strcmp (model, ‘bates’ ) 

12 cfGeneric = @cfBatesGeneric; 

13| elseif strcmp (model, ’bsm’ ) 

14 cfGeneric = @cfBSMGeneric; 

15| elseif strcmp (model, ’merton’ ) 

16 cfGeneric = @cfMertonGeneric; 

17| else 

18 error(’model not specified’ ) 

19| end 

20 

21|% initialize structures 

22| nS = length(XX); % number of strikes 

23| nT = length(TT); % number of expiries 

24| N = length(x); % number of nodes 

25| Prices = NaN(nS,nT); % matrix of model prices 

26| auxX = NaN(nS,N); % 

27 

28|% evaluate surface 

29| ix = 1i * x; 

30| for tt = 1:nT % loop over times to maturity 

31 tau = TT(tt); 

32 % evaluate CF at nodes 

33 CFi = S * exp((r-q) * tau); 

34 CF1 = cfGeneric(x - 11,S,tau,r,q,param) ./ (ix * CFi); 
35 CF2 = cfGeneric(x ,S,tau,r,q,param) ./ ix; 
36 for kk = 1:nS % loop over strikes 

37 if ~isnan(prices0O(kk,tt) ) 

38 X = XX(kk); 

39 if tt == 1 % store for later maturities 
40 auxX(kk,:) = exp(-ix*log(X))’; 

41 end 

42 P1 = 0.5 + w * real (auxX(kk,:)’ .* CF1) / pi; 
43 P2 = 0.5 + w x real(auxX(kk,:)’ .* CF2) / pi; 
44 Price = exp(-q * tau) * S x Pl - exp(-r«tau) «X*P2; 
45 Prices(kk,tt) = max(Price,0); 

46 end 

47 end 

48| end 

49|% replace missing values by zero 

50| prices0 (isnan (prices0)) = 0; Prices(isnan(Prices)) = 0; 
51|% compute distance between Prices and prices0 

52| aux = abs(Prices(:) - prices0(:)); 

53| res = mean (aux ./ prices0O(:)); 

54| me = max(aux); 


As pointed out before, the functions could be accelerated by precomputing more quantities. In 
calibOF, the term ix = 1i * xis fixed for all parameter values. We could (and should) thus 
compute it outside the objective function. But here we have tried to reduce such precomputations 
to make the code clearer. 

The following script shows the function DE. 


Listing 17.29: C-OptionCalibration/M/./Ch15/DE.m 


l| function [xbest,Fbest,Fbv] = DE(OF,Data,P) 
2|% DE.m -- version 2011-01-07 
3|% settings for direct search 
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4| if P.M > 0 

5 optionsA = optimset(’Display’,’off’,’MaxIter’,P.NMiter,... 
6 'MaxFunEvals’,P.NMiter) ; 

7 z = @(param) calibOF (param, Data); 

8| end 

9 

10)% initialize matrices 

11| Fov = NaN(P.nG,1); % best F-value over generations 

12| Fbest = realmax; % best solution value 


13)F = zeros(P.nP,1); vector of F-values of members of population 


15|% construct starting population 
16)hi_lo = Data.Dc(:,2)-Data.Dc(:,1); 
17| P1 = diag(hi_lo) * rand(Data.d,P.nP); 


18) for i = 1:Data.d, Pl(i,:) = Pl(i,:) + Data.Dc(i,1); end 
19 

20|% ... and evaluate it 

21) for i = 1:P.nP 

22 F(i) = feval(OF,P1(:,i),Data); 

23 if isnan(F(i)), F(i) = 1e7; end % just in case... 
24 if F(i) < Fbest 

25 Fbest = F(i); 

26 xbest = P1(:,i); 

27 end 

28| end 

29 


30|% start generations 


31| for k = 1:P.nG 

32 PO = Pl; 

33 Io = randperm(P.nP)’; Ic = randperm(4)'; 

34 RÌ = @ireshitri io, Teli); 

35 R2 = circshift(Io,Ic(2)); 

36 R3 = circshift(Io,Ic(3)); 

37 Pv = PO(:,R1) + P.F * (PO(:,R2) - PO(:,R3));% new solutions 
38 mPv = rand(Data.d,P.nP) < P.CR; % crossover 
39 Pu = PO; Pu(mPv) = Pv(mPv); 

40 for i = 1:P.nP 

41 Pu(:,i) = repair(Pu(:,i),Data); 

42 Ftemp = feval(OF,Pu(:,i),Data); 

43 if Ftemp <= F(i) 

44 Pl(:,i) = Pu(:,i); 

45 F(i) = Ftemp; 

46 end 

47 end 

48 % direct search 

49 if P.NM > 0 

50 if mod(k,P.NMmod) == 0 

51 if P.NMchoice == 1 

52 [ign,nnn] = sort(F); 

53 elseif P.NMchoice == 

54 nnn = randperm(P.nP); 

55 else 

56 error ('choice not allowed’) 

57 end 

58 for ni = 1:P.NMn 

59 auxF = F(nnn(ni)); diff = 1e7; 
60 while diff > P.NMpres 

61 paramS = P1(:,nnn(ni)); 

62 NMsol = fminsearch(z,paramS,optionsA) ; 
63 NMsol = repair( NMsol,Data); 
64 Ftemp = calibOF(NMsol, Data) ; 
65 if Ftemp < F(nnn(ni) ) 

66 P1(:,nnn(ni)) = NMsol; 


67 F(nnn(ni)) = Ftemp; 


68 diff 
69 auxF 
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abs (Ftemp-auxF) ; 
Ftemp; 


70 else 

71 break 
72 end 

73 end 

74 end 

75 end 

76 end 

77 % find best 

78 [Fbest,ibest] = min(F); 
79 Fbv (k) = Fbest; 

80 xbest = P1(:,ibest); 

81| end 

82| fprintf(’standard dev. of solutions %4.3f\n’,std(F)) 


585 


Similarly, we coded PSO in a function PSO. 


Listing 17.30: C-OptionCalibration/M/./Ch15/PSO.m 


l| function [xbest,Fbest,Fbv] = PSO (OF, Data, PS) 

2|% PSO.m -- version 2011-01-07 

3| if PS.NM>0 

4 optionsA = optimset(‘Display’,’off',’MaxIter’,PS.NMiter,... 
5 ‘MaxFunEvals’,PS.NMiter) ; 

6 z = @(param) calibOF (param, Data) ; 

7| end 

8 

91d = PS.d; iner = PS.iner; 

10| Fov = NaN(PS.nG,1); % best F-value over generations 
11|/F = zeros(PS.nP,1); % vector of F-values members of population 
12 

13}v = PS.cv x randn(d,PS.nP); % initialize velocity 

14 

15|% construct starting population 

16|hi_lo = Data.Dc(:,2)-Data.Dc(:,1); 

17| P = diag(hi_lo) * rand(PS.d,PS.nP); 

18| for i = 1:Data.d, P(i,:) = P(i,:) + Data.Dc(i,1); end 
19 

20|% ... and evaluate population 

2l| for i = 1:PS.nP 

22 F(i) = feval(OF,P(:,i),Data); 

23| end 

24| Poest = P; Fbest = F; [Gbest,gbest] = min(F); 

25 

26| for k = 1:PS.nG 

27 v = iner»v + PS.c1 * rand(d,PS.nP) .* (Pbest - P) + 
28 PS.c2*rand(d,PS.nP) .* (Pbhest(:,gbest)*ones(1,PS.nP)-P); 
29 v = min(v, PS.vmax); v = max(v,-PS.vmax) ; 

30 P=P+ v; 

31 for i =- 1<PS nP 

32 P(:,i) = repair(P(:,i),Data); 

33 F(i) = feval(OF,P(:,i),Data); 

34 if isnan(F(i)), F(i)=le7; end 

35 end 

36 I = F < Fbest; 

37 Fbest(I) = F(I); Pbhest(:,1I) = P(:,1); 

38 [Gbest,gbest] = min(Fbest); Fbv(k) = Gbest; 

39 % direct search 

40 if PS.NM>0 

41 if mod(k,PS.NMmod) == 

42 if PS.NMchoice==1 

43 [ign,nnn] = sort(F); 

44 elseif PS.NMchoice==2 

45 nnn = randperm(PS.nP); 
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46 else 

47 error ('choice not allowed’) 

48 end 

49 for ni=1:PS.NMn 

50 auxF = F(nnn(ni)); diff = 1e7; 

51 while diff > PS.NMpres 

52 paramS = P(:,nnn(ni)); 

53 NMsol = fminsearch(z,paramS,optionsA) ; 
54 NMsol repair (NMsol, Data); 

55 Ftemp = calibOF(NMsol, Data) ; 

56 if Ftemp < F(nnn(ni) ) 

57 P(:,nnn(ni)) = NMsol; 

58 if Ftemp < Fbest(nnn(ni) ) 

59 Pbest(:,nnn(ni)) = NMsol; 

60 Fbest (nnn (ni)) = Ftemp; 

61 end 

62 F(nnn(ni)) = Ftemp; 

63 diff = abs(Ftemp-auxF) ; 

64 auxF = Ftemp; 

65 else 

66 break 

67 end 

68 
69 end 

70 end 

71 end 

72 I = F < Fbest; 
73 Fbest (I) 
74 Pbest(:,I) rh) 

75 [Gbest, gbest] min(Fbest) ; 

76 Gbest = Gbest(1); gbest = gbest(1); 
TT Fbv (k) = Gbest; 

78 end 
79| end 

80| xbest = Pbest(:,gbest) ; 

81] Fbest = Gbest; 

82| fprintf(’Standard dev. of solutions %4.3f\n’,std(Fbest) ) 
83| end 


= F(I); 
= P(: 


A complete example follows. We start by choosing a model and setting true parameters for it. 


Then we compute prices0O with the functions that use quad. 


Listing 17.31: C-OptionCalibration/M/./Ch15/calibrate.m 


l|% calibrate.m -- version 2010-11-29 

2| %% choose a model 

3| model = ‘heston’; 

4| model = ‘bates’; 

5| model = ‘bsm’; 

6| Smodel = ‘’merton’; 

7|/%% set true parameters and compute true prices 

SS = 100; q = 0.02; r = 0.02; 

9| XX = 75+5:125; TT = [1712 3/12 6/12 9/12 12° 3)3 

10)nS = length(XX); nT = length(TT); pricesO = NaN(nS,nT); 

11] if strcmp (model, 'heston’ ) 

12 vO ie OR a % current variance 

13 vT = 0..2°2 % long-run variance (theta in paper) 

14 rho = -0.7; % correlation 

15 k a % mean reversion speed (kappa in paper) 
16 sigma = 0.5; % vol of vol 

17 for kk = 1:length (XX) 

18 for tt = 1:length(TT) 

19 prices0O(kk,tt) = 

20 callHestoncf (S,XX(kk),TT(tt),r,q,v0,vT,rho,k,sigma) ; 
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21 end 

22 end 

23 minP = [0.05*2 0.05%2 -1 0.01 0.01)’; 

24 maxP = [0.90*2 0.90^2 1 5.00 5.00]’; 

25| elseif strcmp (model, ‘bates’ ) 

26 v0 a 8 Pee % current variances 

21; vT EnO 2A % long-run variance (theta in paper) 

28 rho = -0.3; % correlation 

29 k = 1.0; % mean reversion speed (kappa in paper) 
30 sigma = 0.3; % vol of vol 

31 lambda = 0.2; % intensity of jumps; 

32 mud = -0.1; % mean of jumps; 

33 vJ = Oy 182 % variance of jumps; 

34 for kk = 1:length (XX) 

35 for tt = 1:length(TT) 

36 prices0O(kk,tt) = callBatescf(S,XX(kk),TT(tt), 

37 r,q,v0,vT,rho,k,sigma, lambda,muJ,vJ) ; 
38 end 

39 end 

40 minP = [0.05*2 0.05%2 -1 0.01 0.01 0.00 -0.25 0.00%2]’'; 
41 maxP = [1.00*2 1,00^2: 1 5.00 5.00 0.50 0.25 0.90%2]'; 
42| elseif strcmp (model, ’bsm’ ) 

43 v ENOS 2A % variance 

44 for kk = 1:length (XX) 

45 for tt = 1:length(TT) 

46 prices0O(kk,tt) = callBSMcf(S,XX(kk) ,TT(tt),r,q,v); 
47 end 

48 end 

49 minP = [0.01%2]’; 

50 maxP = [2.00%2]’; 

51] elseif strcmp (model, ’merton’ ) 

52 v N A % variance (volatility squared) 

53 lambda = 0.2; % intensity of jumps; 

54 mud = -0.03; % mean of jumps; 

55 vJ = 0.03%2; % variance of jumps; 

56 for kk = 1:length (XX) 

57 for tt = 1:length(TT) 

58 prices0O(kk,tt) = callMertoncf(S,XX(kk),TT(tt), 

59 r,q,v, lambda, mug, vg); 
60 end 

61 end 

62 minP = [0.05^2 0.00 -0.25 0.05%2]’; 

63 maxP = [1.00*2 0.50 0.25 0.90%2]’; 

64| else 

65 error(’model not specified’ ) 


Then we set up the structure Data that holds all the variables necessary to compute the objective 
function. 


Listing 17.32: C-OptionCalibration/M/./Ch15/calibrate.m 


66|% add noise 

67| SpricesO = pricesO .* (randn(nS,nT)*0.001+1); 
68 
69| %% nodes/weights for integration 


70| from = 0; to = 200; N = 50; 
71| [x,w] = GLnodesweights (N); 
72| [x,w] = changeInterval(x, w, -1, 1, from, to); 


73|% collect all in Data structure 
74| Data.model = model; 


75|Data.s = S; 
76| Data.q = q; 
71| Data.r = f; 


78| Data.prices0 = prices0; 
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79| Data. XX = XX; 
80| Data. TT = TT; 
81) Data.x = x; 


Finally, we decide on the settings for DE (the matrix P1) and the settings for PSO (the matrix P2); 
then we can run the algorithms. 


Listing 17.33: C-OptionCalibration/M/./Ch15/calibrate.m 


66| Data.w = wW; 

67| Data.De = [minP maxP]; 

68| Data.d = length(minP) ; 

69|% DE settings (P1) 

70| P1.nP = 25; P1.nG = 50; 

71| P1.CR = 0.95; P1.F = 0.5; 

72| P1.d = length(minP) ; 

73| P1.NM = 1; % do direct search (DS)? 1 = yes, 0 = no 
74| P1.NMpres = 0.001; % when to stop DS 

75|P1.NMiter = 100; % maximum iterations for DS 

76| P1.NMmod = 10; % how often to do DS (mod 10: every 10 it.) 
71| P1.NMn = 2; % how many searchers 

78| P1.NMchoice = 1; % what searchers: 1 = elite, 2 = random 
79|% PS settings (P2) 

80| P2.nP = 25; P2.nG = 50; 

81|P2.d = length(minP) ; 

82| P2.cv = 0.1; % inital velocity 

83| P2.vmax = 0.5; % maximum (absolute) velocity 

84| P2.iner = 0.7; % inertia weight 

85|P2.cl = 1; % weight personal best 

86| P2.c2 = 2; % weight alltime best 

87| P2.NM = 1; % do direct search (DS)? 1 = yes, 0 = no 
88| P2.NMpres = 0.001; % when to stop DS 

89| P2.NMiter = 100; % maximum iterations for DS 

90| P2.NMmod = 10; % how often to do DS (mod 10: every 10 it.) 
91| P2.NMn = 2; % how many searchers 

92| P2.NMchoice = 1; % what searchers: 1 = elite, 2 = random 
93 

94|% run DE 

95| fprintf(’\nDifferential Evolution \n’) 

96| tic, [solA,FbestA,FbvA] = DE(@calibOF,Data,P1);toc 

97| [meanE,maxE] = calibOF(solA, Data) 

98|% run PS 

99| fprintf£(’\nParticle Swarm \n’) 


Repairing solutions 


The matrix Dc, that is passed with Data, holds the minimum and maximum levels for the pa- 
rameters. These values were used only for generating the initial solutions. We can also use the 
information as actual bounds of the parameters. If a parameter lies outside this range, we reflect it 
back into its boundaries. This can be done with simple arithmetic operators; see Algorithm 67. 

We have used the fact that the maximum operator max(a, b) and minimum operator min(a, b) 
can be replaced with 


—b 
2 


a+b 
2 


a 


a+b 
2 


+ 


max(a, b) = min(a, b) = 


’ 


2 


= 


which works also on vectors. This mechanism is coded as a function repair that is called after 
E and PSO. 


new solutions are created, but before they are evaluated in D] 
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Algorithm 67 Repairing a solution x by reflection. 


1: set upper bound x! and lower bound x" (a is a temporary variable) 

2: compute a = x — x" # repair upper bound 
3: compute a =a + |a| 

4: compute x =x —a 

5: compute a = x! — x # repair lower bound 
6: compute a =a + |a| 

7: compute x =x +a 

8: compute x = ((x + x!°) + |x — x!|) /2 # final check: is new x above lower bound? 
9: compute x = ((x + x™) — |x — x" |) /2 # final check: is new x below upper bound? 


17.3.3. Two experiments 


We briefly describe experiments on the Heston model and the Bates model; more details can be 
found in Gilli and Schumann (201 La). !° 

We start by creating artificial data sets: the spot price So is 100, the risk-free rate r is 2%, there 
are no dividends. We compute prices for strikes X from 80 to 120 in steps of size 2, and maturities t 
of 1/12, 3/12, 6/12, 9/12, 1, 2 and 3 years. Hence our surface comprises 21 x 7 = 147 prices. Given a 
set of parameters, we compute option prices and store them as the true prices. Then we run each of 
our methods 10 times to solve Problem 17.24 and see if we can recover the parameters; the setup 
implies that a perfect fit is possible. The parameters for the Heston model come from the following 
table (each column is one parameter set): 


Jo 03 03 03 03 04 02 05 06 07 0.8 
vO 0.3 03 02 02 02 04 05 03 03 03 


p —0.3 —0.7 —0.9 0.0 —0.5 —0.5 0.0 —0.5 —0.5 —0.5 
K 20 02 30 30 02 02 05 30 20 10 
o 15 10 05 05 08 08 30 10 10 10 


For the Bates model, we use the following parameter sets: 


J/v 03 03 03 03 04 02 05 06 07 08 


v0 03 03 02 02 02 04 05 03 03 03 
p —0.3 —0.7 —0.9 0.00 —0.5 —0.5 0.0 —0.5 —0.5 —0.5 
K 20 02 30 30 02 02 05 30 20 1.0 
o 03 05 05 05 08 08 10 10 10 10 
À 01 01 02 02 02 02 02 02 02 0.2 
by 0.1 —0.1 —0.1 —0.1 —0.1 —0.1 —0.1 —0.1 —0.1 —0.1 
oJ 01 01 01 O1 01 O01 OI O1 01 O1 


With ten different parameter sets for each model and with ten restarts for each parameter set, we 
have 100 results for each optimization method. For each restart, we store the value of the objective 
function (the mean percentage error; Eq. (17.24)), and the corresponding parameter estimates. For 
the latter, we compute absolute errors, that is, 


error = | estimated parameter — true parameter |. 


Below we look at the distributions of these errors. 
All algorithms are coded with MATLAB, for the direct search we use MATLAB’s fminsearch. 
We ran a number of preliminary experiments to find effective parameter values for the algorithms. 


10. The implementation was slightly different there, but that does not change the results. 
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For DE, the F-parameter should be set to around 0.3—0.5 (we use 0.5); very low or high values 
typically impaired performance. The CR-parameter had less influence, but levels close to unity 
worked best; each new candidate solution is then likely changed in many dimensions. For PSO, 
the main task is to accelerate convergence. Velocity should not be allowed to become too high; 
therefore, inertia should be below unity (we set it to 0.7); we also restricted maximum absolute 
velocity to 0.2. The stopping criterion for DE and PSO is a fixed number of function evaluations 
(population size x generations); we run three settings, 


1250 (25 x 50), 
5000 (50 x 100), 
20,000 (100 x 200). 


An alternative stopping criterion is to halt the algorithm once the diversity within the population— 
as measured, for instance, by the range of objective function or parameter values—falls below a 
tolerance level. This strategy works fine for DE where the solutions generally converge rapidly, but 
leads to longer run times for PSO. 

For the hybrid methods, we use a population of 25 solutions, and run 50 generations. Every 10 
generations, one or three solutions are selected, either the best solutions (“elitists”) or random solu- 
tions. These solutions are then used as the starting values of a direct search. This search comprises 
repeated applications of Nelder—Mead, restricted to 200 iterations each, until no further improve- 
ment can be achieved; “further improvement” is a decrease in the objective function greater than 
0.1%. Note that all the settings can be passed through the structures Data and P (see the MATLAB 
script calibrate.m). 


Results: Heston model 
Price errors 


Below we plot the distributions of price errors in percentage points. Increasingly darker gray stands 
for more function evaluations. On the left, we have DE; in the middle, we have PSO; on the right, 
the hybrid based on DE. 
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Both DE and PSO, with increasing computational resources, eventually give solutions with a zero 
error. The performance of the PSO-based hybrid was very similar to that based on DE, even though 
there was a slight advantage for DE. With DE the population converged faster and hence it mattered 
little for the hybrid whether the solutions for the direct search were chosen by quality or randomly. 
This can be taken from the right panel in the above picture. There are actually four curves: the 
dashed lines are from random searchers, the solid lines from elite searchers. Do not try to identify 
a specific distribution; they are essentially the same. We see that there are a few rare outliers (the 
long tails to the upside), but generally the hybrid performed very well. When the hybrid was based 
on PSO, then direct search that started with the best population members performed better than 
randomly-chosen members. 


Parameter errors 


So we can get a good fit of the model (i.e., a low average price error). But what about the parame- 
ters? 

The figure below shows the distributions of absolute errors in the parameters for ,/v9 (left), /6 
(middle), and p (right). 
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Below, we have the distributions of absolute errors in the parameters for « (left), and o (right). 
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We see that the parameter estimates roughly mirror the convergence behavior of the objective func- 
tion values. With increasing computational resources (indicated by darker gray), the distributions 
converge on zero. In some case, though, there remain large errors; for instance, when we look at 
the long-run variance. 

To investigate this behavior, we pool all the solutions and plot the errors in the objective function 
(i.e., the fit in terms of pricing error) against the parameter errors. The following panel shows the 
results. In the upper panel: „/vo (left), VO (middle), and p (right). In the lower panel: « (left), and 
o (right). All plots have the price error on the x-axis, the parameter error on the y-axis. 
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For illustration, in the middle upper panel (long-run volatility) we have singled out one point. The 
price error is less than 0.5% (i.e., perfectly acceptable), but the error in the volatility is 25%, which 
is a large number when interpreted financially. 


Results: Bates model 
Price errors 


Again, we start with a plot of the price errors in percentage points. Again, on the left is DE; in the 
middle is PSO; on the right is the hybrid based on DE. We see convergence, but slower this time. 
This should be expected to some extent since the model is more complicated (has more parameters). 


1 1 1 
0.5 0.5 0.5 


0 T T 0 T T 0 T T 
0 2 4 0 2 4 0 2 4 


Errors in % Errors in % Errors in % 


592 PART | III Optimization 


Parameter errors 


Again, we also look at the errors in the parameters. The next figure shows the distributions of 
absolute errors in the parameters for ./Ug (left), V/O (middle), and p (right). 
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Next, we have the distributions of absolute errors in the parameters for « (left), and o (right). 
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Finally, the figures below shows the distributions of absolute errors in the jump parameters: A 
(left), uy (middle), and o (right). 
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Remarkably, there seems to be no convergence for these parameters, even as we increase the number 
of function evaluations (i.e., let the algorithms search longer). 


We again plot the errors in the objective function (on the x-axes) against the average pricing 
errors (i.e., the realized objective function value). Below: ./vo (left), /@ (middle), and p (right). 
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Below: A (left), uy (middle), and o;z (right). 
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So for the Bates model, those parameters that are also in the Heston model are estimated less 
precisely; but for the jump parameters (A, uj, and øz) there is essentially no convergence. No 
convergence in the parameters does not mean we cannot get a good fit; see the figures above. This 
non-convergence is to some extent due to our choice of parameters. Experiments with Merton’s 
model (not reported) showed that for “small” mean jumps uy of magnitude —10% or —20%, it 
is difficult to recover parameters precisely because many parameter values give price errors of 
practically zero. In other words, the numerical optimization is fine, we can well fit the prices, 
but we cannot accurately identify the different parameters. In any case, parameter values of the 
magnitude used here have been reported in the literature (e.g., in Schoutens et al., 2004; or in 
Detlefsen and Hardle, 2007). The numerical precision improves for large jumps. This is consistent 
with existing studies. He et al. (2006) for instance report relatively precise estimates for a mean 
jump size of —90%, yet at some point, this begs the question how much reason we should impose 
on the parameters. An advantage of theoretical models over simple interpolatory schemes is the 
interpretability of parameters. If we can only fit option prices by unrealistic parameters, or cannot 
identify the parameters with any meaningful accuracy, there is little advantage in using such models. 


17.4 Final remarks 


In this chapter we have investigated the calibration of option pricing models. For more complex 
models, we have shown how to calibrate the parameters of a model with heuristic techniques, and 
that we can improve the performance of these methods by adding a Nelder—Mead direct search. 
While good price fits could be achieved with all methods, the convergence of parameter estimates 
was much slower; for the jump parameters of the Bates model, there was no convergence. This, it 
must be stressed, is not a problem of the optimization technique, but it stems from the model. In 
comparison, parameters of the Heston model could be estimated more easily. 

In empirical studies on option pricing models (e.g., Bakshi et al., 1997; or Schoutens et al., 
2004), the calibration is often taken for granted; it is rarely discussed whether, for instance, restarts 
of an optimization routine with different starting values would have resulted in different parameter 
estimates, and how such different estimates would have influenced the studies’ results. (In Gilli 
and Schumann (2010a) we showed that standard gradient-based methods often fail for the kinds of 
calibration problems discussed in this chapter, and that restarts with different starting values can 
lead to very different solutions.) Different parameter values may lead to good overall fits in terms 
of prices, but these different parameters may well imply different Greeks, or have a more marked 
influence on prices of exotic options. Hence, empirical studies that look, for example, into hedging 
performance should take into account the sensitivity of their results with respect to calibration. With 
luck, all that is added is another layer of noise; but the relevance of optimization is to be investigated 
by empirical testing, not by conjecturing. Such testing is straightforward: we just need to rerun 
our empirical tests many times, each time also rerunning our calibration with alternative starting 
values and, hence, get an idea of the sensitivity of outcomes with respect to optimization quality. 
Ideally, optimization quality in-sample should be evaluated jointly with empirical, out-of-sample 
performance of the model; see Gilli and Schumann (201 1b). 

Our findings underline the point raised in Gilli and Schumann (201 0d) that modelers in quantita- 
tive finance should be skeptical of purely-numerical precision. Model risk is a still underappreciated 
aspect in quantitative finance (and one that had better not be handled by rigorous mathematical 
modeling). For instance, Schoutens et al. (2004) showed that the choice of an option pricing model 
can have a large impact on the prices of exotic options, even though all models were calibrated 
to the same market data. (Unfortunately, different calibration criteria lead to different results; see 
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Detlefsen and Härdle, 2007). In the same vein, Jessen and Poulsen (2009) find that different models, 
when calibrated to plain vanilla options, exhibit widely differing pricing performance when used to 
explain actual prices of barrier options. Our results suggest that the lowly numerical optimization 
itself can make a difference. How important this difference is needs to be assessed empirically. 


Appendix 17.A Quadrature rules for infinity 


There are Gauss rules if the integrals are defined from or to infinity. We may have 


f roa or f row. 
0 =00 


The relevant Gauss rules here are called Gauss-Laguerre, and Gauss—Hermite. For Gauss— 
Laguerre, we use the following trick: 


f roa TC ~ X ewfa); 
0 0 


thus, we weight the function with another function, e~*, that rapidly decays to zero. 
For Gauss—Hermite, we use 


[0,6] 


f f(x)dx = f ee fode Y e wfa). 


—oo 


We can proceed as before with the Gauss—Legendre quadrature. For the Laguerre polynomials, 
we have the recurrence 


(n+ lg, =Qn+1—xgl*— ng)" ,. 


or 
2n—1 n—1 
La La La La 
Pn — = X@y-1 =P n XPn—1 n Pn—2 
w —SE—=_—" ——"” 
an Bn Yn 


From these we obtain 6, = 2n — 1, and n, =n. 
For the Hermite polynomials, we have the relation 


H H H 
Pn+1 = 2X Pn = 2NYy—| ’ 
and so 


gi =_2_ xoi- 2n- Doa. 
Qn Yn 


Thus, we have 6, = 0, and nn = ~y”/2. 
We will need the integrals of the weights functions: 
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FIGURE 17.8 Absolute errors for Gauss—Legendre and Gauss-Laguerre compared with normcdf for an increasing num- 
ber of nodes. Errors are plotted on the y axis; the x axis shows the number of nodes. 


For ¢ = 1, we get the integrals 1 and ./z. 


ee 
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Listing 17.34: C-OptionCalibration/M/./Ch15/GLanodesweights.m 


function [x,w] = GLanodesweights (n) 

% GLanodesweights -- version 2010-10-24 

% (G_auss La_guerre...) 

delta = 2*(1:n)-1; eta = 1:(n-1); 

A = diag(delta) + diag(eta,1) + diag(eta,-1); 


[V,D] = eig(A); 

x = diag(D); 

% Matlab does not guaranty sorted eigenvalues 

[xi] = sort (x); 

% weights: for Laguerre, integral from 0 to infty = 1 


w= V(1,i) .* 2; 


Listing 17.35: C-OptionCalibration/M/./Ch15/GHnodesweights.m 


function [x,w] = GHnodesweights (n) 

% GHnodesweights.m -- version 2010-11-07 
% (G_auss H_ermite...) 

eta = sqrt((1:(n-1)) / 2); 

A = diag(eta,1) + diag(eta,-1); 


[V,D] = eig(A); 

x = diag(D); 

% Matlab does not guaranty sorted eigenvalues 

[x,i] = sort(x); 

% weights: for Hermite, integral from -inf to inf = sqrt of pi 


w = sqrt(pi) + V(1,i) .* 2; 


We can now repeat Example 17.3. 


Listing 17.36: C-OptionCalibration/M/./Ch15/integrateGauss2.m 


%% integrateGauss2.m -- version 2010-10-24 
%% (also tested with Octave 4.2.2) 


oe 


goal: to compute N(b) 
= 2; 


ov 


oe 


number of nodes 


n = 20; 
% replace minus infinity by 
lowerLim = -10; 


% compute nodes/weights 
[x,w] = GLnodesweights(n) ; 
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19 
20 
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25 
26 
27 


% change interval of integration 
[x,w] = changeInterval(x, w, -1, 1, lowerLim, b); 


[x2,w2] = GLanodesweights(n) ; 
% result of integration 

ourResultGLegendre = w*GaussF (x) 

ourResultGLaguerre = w2 * (exp(x2).* GaussF(-(x2-b))) 


% result of normcdf from Statistics Toolbox 
MatlabResult = normcdf (b) 


abs (ourResultGLegendre-MatlabResult) 
abs (ourResultGLaguerre-MatlabResult) 


Fig. 17.8 compares the absolute error of Gauss—Laguerre with the more straightforward Gauss— 


Legendre (with —oo replaced by —10). We see that with more than, say, 10 nodes, the error for the 
latter approach is actually smaller. 


Appendix A 


The NMOF package 


A.1 Installing the package 


The NMOF package is available from CRAN. The easiest way to get it is to install directly from 
within R: 


> install.packages ("NMOF" ) 
The package on CRAN is updated once or twice per year. If you want to use the latest version, you 


can get it from the maintainer’s website: 
http://enricoschumann.net/R/packages/NMOF/ 

You may also directly install from that site: 

> install.packages (‘NMOF’, 


repos = c(’http://enricoschumann.net/R’, 
getOption(’repos’ )) ) 


The latest version of the package is also mirrored to GitHub and GitLab: 
https://github.com/enricoschumann/NMOF 
https://gitlab.com/enricoschumann/NMOF 


To see what is new in the package, check the NEWS file, which you also may do from within R. 


> news (Version >= "1.0-0", package = "NMOF") 


A.2 News, feedback and discussion 

New versions of the package and other news are announced through the NMOF -news mailing list. 

To browse the archives or to subscribe, go to 
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/nmof-news 

An RSS feed of the package NEWS file is available at 
http://enricoschumann.net/R/packages/NMOF/NMOF_news.xml 


Applications, as long as they are finance-related, should be discussed on the R-STG-Finance 
mailing list. To browse the archives or to subscribe, go to 


https://stat.ethz.ch/mailman/listinfo/r-sig- finance 


Please send bug reports or suggestions directly to the package maintainer, for instance by using 
bug.report. 


> require("utils") 
> bug.report("[NMOF] Unexpected behavior in function XXX", 
maintainer ("NMOF"), package = "NMOF") 


A.3 Using the package 


You can directly access all the R scripts that are displayed in the book with the function showEx- 
ample. For instance: 
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[vignettes] 


[vignette] 


598 The NMOF package 


> library ("NMOF") 


> show! 
> show! 


Example ("exampleOF.R") 
Example ("exampleLS.R", 


chapter 


## first edition 
= 13) ## first edition 


There are also many other code examples in the book, notably in the vignettes. 


> vignette(package = "NMOF") 


You can directly access the code in a vignette. 


> nss <- vignette("DEnss", package = 
> print (nss) ## show PDF 


= @eliiic 


(nss) ## show code 


Many more examples are in the NMOF manual: 
http://enricoschumann.net/NMOF.htm#NMOFmanual 


"NMOF" ) 
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cond, 26 

dmperm, 48, 49 

eps, 38 

fminbnb, 245 

fminsearch, 238, 253 

fminunc, 245, 254 

fsolve, 256, 267 

full, 47 

fzero, 240 

optimset, 240 

parfor, 311 

parpool, 311 
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rand, 231 

sparfun, 48 

sparse, 47 

spdiags, 48, 86 

speye, 48 

sprandn, 48 


sprank, 48 
spy, 49 
svd, 38 
tic, toc, 34 
Matrix (R package), 136 
Matrix 
banded, 45 
bandwidth of, 32 
indecomposable, 49 
irregular sparse, 47 
non singular, 48 
positive-definite, 35 
positive-definite test, 35 
rank computation, 38, 136 
sparse, 45 
structural properties of sparse, 48 
structural rank of, 48 
structurally singular, 49 
tridiagonal, 45 
merge_series (R function), 454 
Mersenne Twister, 10, 107 
Merton’s jump—diffusion model, 556 
Metropolis algorithm, 146, 150 
Metropolis—Hastings algorithm, 148 
Minimizer, global, local, 241 
Minimum-variance portfolios, 358 
computation by regression, 523 
minvar (R function in package 
NMOF), 407, 471 
mom (R function), 465 
Moments, 385 
conditional moments, 386 
partial moments, 385 
Monte Carlo, 103, 146, 153-155, 167, 
180, 190, 194 
error, 118, 155 
option pricing, 157 
quasi-Monte Carlo, 119 
Moving average, 163 
Multi-agent method, 283 
Multivariate distribution, copula, 148 
mv_1s (R function), 475 
mv_qp (R function), 471 
mvFrontier (R function in package 
NMOF), 364 
mvPort folio (R function in 
package NMOF), 361, 362 


N 
n.i.i.d., 163 
Nelson—-Siegel model, 227 
forward rates, 494 
spot rates, 492 
Nelson—Siegel-Svensson model, 227 
forward rates, 494 
spot rates, 494 
Newton—Raphson method, 264 
Newton’s method, 553 
nonlinear equations 
n dimensions, 263 
one dimension, 238 
starting value, 515 
unconstrained optimization 
n dimensions, 247 


one dimension, 243 
nlminb (R function in package 
stats), 507 
NMOF (R package), 7 
parallel computing, 486 
Nonlinear Least Squares, 256 
Normal equations, 51 
Normalization of equations, 38, 48 
NS (R function in package NMOF), 
496 
NSf (R function in package NMOF), 
494 
NSS (R function in package NMOF), 
496 
NSSf (R function in package NMOF), 
494 
Numerical instability, 24 


O 
O big-oh notation, 29 
Objective function, 241 
portfolio optimization, 401, 420 
in R, 420-424 
ODE, 64 
one (R function), 481 
Operation count, 30, 36, 37 
optim (R function), 276 
Option pricing, simulation, 158 
Options 
American, 79, 96 
American strangle, 85 
barrier, 75 
dividends, 96 
Early exercise, 95 
European, 157 
Ordinary differential equation, 64 
Overflow, 19 
Overidentified system, 50 


P 
Packages 
Matrix, 490 
parallel (R package), 480 
Parallel computation 
with MATLAB, 309 
parLappl]y (R function in package 
parallel), 481 
Partial differential equation, 64 
Partial moments, 385 
Particle Swarm Optimization 
algorithm, 281 
description, 281 
vectorized implementation, 529 
with MATLAB, 308, 582-588 
PDE, 64 
Penalty function, scaling, 505 
PMwR (R package), 440, 451 
Poisson distribution, 115 
Polar method, 112 
Portfolio insurance, 189 
Portfolio optimization 
index tracking, 364 
minimum-variance, 358 
tangency portfolio, 362 


Value-at-Risk constraint, 413 
Precision, 19 
Price patterns, 178 
Price—earnings ratio, 178 
profile, 289 
Projected successive overrelaxation 
(PSOR), 79 
Pseudoinverse, 55 
Pseudorandom, 104 
numbers, 128 
PSopt (R function in package 
NMOF), 316 
PSopt. (R function), 528 
Put—call parity, 66, 551 


qr, 136 

QR decomposition, 37 

qTable (R function in package 
NMOF), 519, 537 

quad, 560 

quadprog (R package), 359, 373, 
471 

Quadrature, 568-576 

QuantLib, 7 

Quartile plots, 519 

Quasi-Monte Carlo, 104, 119, 
210-216 

effective dimension, 215, 216 
Quasi-Newton method, 248, 268 
Quasirandom, 104 


R 

Rcode 
Gaussian2.R, 135 
randn.R, 133 
Spearman.R, 139 
tria .R, 142, 143 
GARCHDEopt .R, 545, 546 

R packages 
BurStFin, 472 
MASS, 431, 532 
Matrix, 136 
NMOF, 7 
parallel, 480 
PMwR, 440, 451 
quadprog, 359, 373, 471 
rbenchmark, 9 
RiskPortfolios, 472 
Rmpfr, 323 
robustbase, 472 
RQuantLib, 553 
zoo, 451 

Random number generator 
x2 distribution, 114 
acceptance—rejection method, 109, 

125, 146, 150 

Bernoulli distribution, 116 
binomial distribution, 116 
Box—Muller method, 111 
Cauchy distribution, 115 
congruential, 104 
copula, 148 
Cornish—Fisher expansion, 113 
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discrete distribution, 116 
empirical distribution, 122 
exponential distribution, 115 
F distribution, 115 
inversion method, 107 
Laplace distribution, 115 
lognormal distribution, 114 
Matlab, 128 
Mersenne Twister, 128 
Metropolis algorithm, 146, 150 
Metropolis—Hastings algorithm, 
148 
multivariate distributions, 147 
normal distribution, 111 
Poisson distribution, 115 
polar method, 112 
roulette wheel selection, 117 
shuffling, 118 
stratified sampling, 120 
student ¢ distribution, 115 
uniform, 104 
variance reduction, 121 
ziggurat algorithm, 110, 129 
Random numbers, uniform, 104 
Random walk, 161 
random_returns (R function), 
359 
random_x (R function), 407 
randomSort (R function), 326 
rank, 136 
rank (R function), 139 
Rank of matrix, computation of, 38, 
136 
rankMatrix (R function in package 
Matrix), 136 
rbenchmark (R package), 9 
Rebalance during backtest, 443 
Recycling rule, 501, 529 
Relaxation parameter, 40 
repairMatrix (R function in 
package NMOF), 372, 373 
replicate (R function), 481 
Replication, 119, 127 
restartOpt (R function in package 
NMOF), 317, 340, 382 
restartOpt. (R function), 382 
Riemann sum, 569 
RiskPortfolios (R package), 
472 
Rmpfr (R package), 323 
rnorm, 133 
robustbase (R package), 472 
Roulette wheel selection, 117 
Rounding error, 17 
Rprof, 289 
RQuantLib (R package), 553 
runif (R function), 320 
set.seed (R function), 320 


S 

S-estimator, 236 

sample, 379 

SAopt (R function in package 
NMOF), 315 
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Satisficing, 289 
Scenario generation, 367 
arbitrage, 412 
updating for TA, 399 
Schwarz criterion (model selection), 
344 
Search 
guided, 283 
unguided, 283 
Semi-variance, 223, 385 
series_ratio (R function), 454 
Sharpe ratio 
with negative returns, 413 
Shekel function, 297 
Shiller (R function), 451 
Shiller, Robert, 451 
showExamp1e (R function in 
package NMOF), 597 
Shrinkage (variance—covariance 
matrix), 374, 375 
Shuffling, 118 
Simpson’s rule, 560, 571 
Simulated Annealing, 277 
algorithm, 277 
Simulation 
m, 112 
ARMA model, 166 
Black-Scholes model, 159 
controlled experiments, 127 
experimental design, 127 
GARCH, 173 
option pricing, 158 
options, 157 
portfolio, 155 
stock, 154 
Single-agent method, 282 
Singular value decomposition, 37 
Smooth transition autoregressive 
(STAR) model, 548 
SNB (Swiss central bank), 514 
Solution by minimization, 270 
solve (R function), 57 
solve .QP (R function in package 
quadprog), 359, 392 
SOR, 40 
sort, 532 
sort (R function), 532 
Sorting 
partial, 532 
Sortino ratio, 385 
source (R function), 486 
Sparse linear system, 45 
Stability, 73 
Stationarity, 146, 157, 167 
trend-stationarity, 169 


Steepest descent method, 245 
Stopping region, 81 

Storage scheme for matrix, 47 
Strangle, 85 

Stratified sampling, 120 
Structural rank, 48 
Structurally singular matrices, 49 
Student ¢ distribution, 115 
Style analysis, 525 

Subset sum problem, 298 
Successive overrelaxation, 40 
SVD, 37 

system. time, 289 


T 
Tabu Search, 279 
algorithm, 279 
Tangency portfolios, 362 
computation by regression, 523 
TAopt (R function in package 
NMOF), 315, 336, 390 
Taylor-Thompson algorithm, 126 
Theme of the book, 15 
Theta (option price), 98 
Threshold Accepting 
algorithm, 278 
neighborhood for portfolio 
selection, 381, 394, 403 
scenario updating, 399 
with MATLAB, 298 
Threshold methods, 278 
tic, toc, 289 
tiedrank, 140 
Time series, 542 
trade_details (R function), 441 
Trapezoidal rule, 571 
Triangular system, 32 
Tridiagonal system, 45 
Truncation error, 17 
Trust region method, 270 
try (R function), xxiv 


U 

Unconstrained optimization in 
MATLAB, 254 

Underflow, 19 

uniroot, 564 

Upside potential ratio, 385 


v 

Validation of backtest 
single split, 436 
walk-forward, 436 

Value-at-Risk, 181, 192 
as a quantile, 387 


covered put, 159 
definition, 387 
Extreme Value Theory, 193 
Generalized Pareto Distribution, 
194 
Hill estimator, 193 
in objective function, 387 
portfolios, 413 
Van der Corput sequences, 211 
vanillaBond (R function in 
package NMOF), 489 
vanillaOptionAmerican (R 
function in package 
NMOF), 95 
vanillaOptionEuropean (R 
function in package 
NMOF), 553 
vanillaOptionImpliedVol (R 
function in package 
NMOF), 553 
Variance gamma process 
characteristic function, 561 
Variance reduction, 121, 203—207 
antithetic variables, 121 
importance sampling, 121 
Variance—covariance matrix 
creating random 
variance—covariance 
matrices, 360 
Vector autoregressive (VAR) model, 
549 
Vector error correction model 
(VECM), 549 
Volatility, 169 
autocorrelated, 170 
clustering, 169 


W 
Walk-forward, 436 
Watson, Thomas J., 3 


X 


xwGauss (R function in package 
NMOF), 570, 573 


Y 
Yield-to-maturity, 488 
initial guess for computation, 489 
ytm (R function in package NMOF), 
489 


Z 
Zero of function, 229 
zoo (R package), 451 
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